1. Dataset¶

  • source: https://www.kaggle.com/datasets/quangvinhhuynh/marketing-and-retail-analyst-e-comerce

orders¶

  • order_id(PK): 주문의 고유 식별자, 이 테이블의 기본 키 역할
  • customer_id: 고객의 고유 식별자
  • order_status: 주문 상태. 예: 배송됨, 취소됨, 처리 중 등
  • order_purchase_timestamp: 고객이 주문을 한 시점의 타임스탬프
  • order_approved_at: 판매자가 주문을 승인한 시점의 타임스탬프
  • order_delivered_timestamp: 고객 위치에 주문이 배송된 시점의 타임스탬프
  • order_estimated_delivery_date: 주문할 때 고객에게 제공된 예상 배송 날짜

order_items¶

  • order_id(PK): 주문의 고유 식별자
  • order_item_id(PK): 각 주문 내 항목 번호. 이 컬럼과 함께 order_id가 이 테이블의 기본 키 역할
  • product_id: 제품의 고유 식별자
  • seller_id: 판매자의 고유 식별자
  • price: 제품의 판매 가격
  • shipping_charges: 제품의 배송에 관련된 비용

customers¶

  • customer_id(PK): 고객의 고유 식별자, 이 테이블의 기본 키 역할
  • customer_zip_code_prefix: 고객의 우편번호
  • customer_city: 고객의 도시
  • customer_state: 고객의 주

payments¶

  • order_id: 주문의 고유 식별자, 이 테이블에서 이 컬럼은 중복될 수 있다
  • payment_sequential: 주어진 주문에 대한 결제 순서 정보를 제공
  • payment_type: 결제 유형 예: 신용카드, 직불카드 등
  • payment_installments: 신용카드 결제 시 할부 회차
  • payment_value: 거래 금액

products¶

  • product_id: 각 제품의 고유 식별자, 이 테이블의 기본 키 역할
  • product_category_name: 제품이 속한 카테고리 이름
  • product_weight_g: 제품 무게 (그램)
  • product_length_cm: 제품 길이 (센티미터)
  • product_height_cm: 제품 높이 (센티미터)
  • product_width_cm: 제품 너비 (센티미터)

1-1. Data load¶

  • merge_df : 모든 테이블을 하나의 데이테세으로 만든 데이터
In [1]:
# 필요한 모듈들 모두 이 셀에서 관리

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
In [2]:
# 데이터 불러오기
import pandas as pd
customers_df = pd.read_csv("D:/bootcamp/machine_learning/project/customers.csv") #99441
order_items_df = pd.read_csv("D:/bootcamp/machine_learning/project/order_items.csv") #112650
orders_df = pd.read_csv("D:/bootcamp/machine_learning/project/orders.csv") #99441
payments_df = pd.read_csv("D:/bootcamp/machine_learning/project/payments.csv") #103886
products_df = pd.read_csv("D:/bootcamp/machine_learning/project/products.csv") #32951
In [3]:
customers_df = customers_df.drop_duplicates()

# orders_df와 customers_df를 customer_id를 기준으로 병합
merged_df = pd.merge(orders_df, customers_df, on='customer_id', how='left')

# merged_df와 order_items_df를 order_id를 기준으로 병합
merged_df = pd.merge(merged_df, order_items_df, on='order_id', how='left')

# merged_df와 payments_df를 order_id를 기준으로 병합
merged_df = pd.merge(merged_df, payments_df, on='order_id', how='left')

# merged_df와 products_df를 product_id를 기준으로 병합
final_df = pd.merge(merged_df, products_df, on='product_id', how='left')

# 결과를 CSV 파일로 저장
# final_df.to_csv("merged_data_final.csv", index=False)
merge_df = final_df
In [4]:
merge_df.head(5)
Out[4]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
0 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP ... 8.72 1.0 credit_card 1.0 18.12 housewares 500.0 19.0 8.0 13.0
1 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP ... 8.72 3.0 voucher 1.0 2.00 housewares 500.0 19.0 8.0 13.0
2 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP ... 8.72 2.0 voucher 1.0 18.59 housewares 500.0 19.0 8.0 13.0
3 53cdb2fc8bc7dce0b6741e2150273451 af07308b275d755c9edb36a90c618231 delivered 7/24/2018 20:41 7/26/2018 3:24 8/7/2018 15:27 8/13/2018 0:00 47813 barreiras BA ... 22.76 1.0 wallet 1.0 141.46 toys 400.0 19.0 13.0 19.0
4 47770eb9100c2d0c44946d9cf07ec65d 3a653a41f6f9fc3d2a113cf8398680e8 delivered 8/8/2018 8:38 8/8/2018 8:55 8/17/2018 18:06 9/4/2018 0:00 75265 vianopolis GO ... 19.22 1.0 credit_card 3.0 179.12 toys 420.0 24.0 19.0 21.0

5 rows × 24 columns

2. Exploratory Data Analysis(EDA)¶

2-1. 데이터 간단히 살펴보기¶

In [5]:
# 데이터 정보 요약 : info()

merge_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119160 entries, 0 to 119159
Data columns (total 24 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       119160 non-null  object 
 1   customer_id                    119160 non-null  object 
 2   order_status                   119160 non-null  object 
 3   order_purchase_timestamp       119160 non-null  object 
 4   order_approved_at              118982 non-null  object 
 5   order_delivered_timestamp      115738 non-null  object 
 6   order_estimated_delivery_date  119160 non-null  object 
 7   customer_zip_code_prefix       119160 non-null  int64  
 8   customer_city                  119160 non-null  object 
 9   customer_state                 119160 non-null  object 
 10  order_item_id                  118325 non-null  float64
 11  product_id                     118325 non-null  object 
 12  seller_id                      118325 non-null  object 
 13  price                          118325 non-null  float64
 14  shipping_charges               118325 non-null  float64
 15  payment_sequential             119157 non-null  float64
 16  payment_type                   119157 non-null  object 
 17  payment_installments           119157 non-null  float64
 18  payment_value                  119157 non-null  float64
 19  product_category_name          117893 non-null  object 
 20  product_weight_g               118305 non-null  float64
 21  product_length_cm              118305 non-null  float64
 22  product_height_cm              118305 non-null  float64
 23  product_width_cm               118305 non-null  float64
dtypes: float64(10), int64(1), object(13)
memory usage: 21.8+ MB
In [6]:
# 데이터 기술통계량 : describe()

merge_df.describe(include='all')
Out[6]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
count 119160 119160 119160 119160 118982 115738 119160 119160.000000 119160 119160 ... 118325.000000 119157.000000 119157 119157.000000 119157.000000 117893 118305.000000 118305.000000 118305.000000 118305.000000
unique 99441 96096 8 88789 50462 75649 459 NaN 4119 27 ... NaN NaN 5 NaN NaN 70 NaN NaN NaN NaN
top 895ab968e7bb0d5659d16cd74cd1650c 9a736b248f67d166d2fbb006bcb877c3 delivered 8/8/2017 20:26 1/10/2018 10:32 8/14/2017 12:46 12/20/2017 0:00 NaN sao paulo SP ... NaN NaN credit_card NaN NaN toys NaN NaN NaN NaN
freq 63 75 115739 63 121 63 658 NaN 18864 50305 ... NaN NaN 87795 NaN NaN 88791 NaN NaN NaN NaN
mean NaN NaN NaN NaN NaN NaN NaN 35000.842204 NaN NaN ... 20.037466 1.094178 NaN 2.941053 172.825878 NaN 2113.957364 30.254537 16.633143 23.066692
std NaN NaN NaN NaN NaN NaN NaN 29806.174149 NaN NaN ... 15.847918 0.726675 NaN 2.777111 267.813648 NaN 3789.273735 16.189991 13.453618 11.743808
min NaN NaN NaN NaN NaN NaN NaN 1003.000000 NaN NaN ... 0.000000 1.000000 NaN 0.000000 0.000000 NaN 0.000000 7.000000 2.000000 6.000000
25% NaN NaN NaN NaN NaN NaN NaN 11250.000000 NaN NaN ... 13.080000 1.000000 NaN 1.000000 60.890000 NaN 300.000000 18.000000 8.000000 15.000000
50% NaN NaN NaN NaN NaN NaN NaN 24240.000000 NaN NaN ... 16.280000 1.000000 NaN 2.000000 108.210000 NaN 700.000000 25.000000 13.000000 20.000000
75% NaN NaN NaN NaN NaN NaN NaN 58415.000000 NaN NaN ... 21.190000 1.000000 NaN 4.000000 189.260000 NaN 1800.000000 38.000000 20.000000 30.000000
max NaN NaN NaN NaN NaN NaN NaN 99990.000000 NaN NaN ... 409.680000 29.000000 NaN 24.000000 13664.080000 NaN 40425.000000 105.000000 105.000000 118.000000

11 rows × 24 columns

In [7]:
# 결측치 확인

merge_df.isna().sum()
Out[7]:
order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 178
order_delivered_timestamp        3422
order_estimated_delivery_date       0
customer_zip_code_prefix            0
customer_city                       0
customer_state                      0
order_item_id                     835
product_id                        835
seller_id                         835
price                             835
shipping_charges                  835
payment_sequential                  3
payment_type                        3
payment_installments                3
payment_value                       3
product_category_name            1267
product_weight_g                  855
product_length_cm                 855
product_height_cm                 855
product_width_cm                  855
dtype: int64
In [8]:
# 결측치를 포함하는 column

missing_columns = merge_df.isna().sum()[merge_df.isna().sum() > 0].index
missing_columns
Out[8]:
Index(['order_approved_at', 'order_delivered_timestamp', 'order_item_id',
       'product_id', 'seller_id', 'price', 'shipping_charges',
       'payment_sequential', 'payment_type', 'payment_installments',
       'payment_value', 'product_category_name', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm'],
      dtype='object')

수치형, 범주형 변수 나눠서 데이터 살펴보기

In [9]:
merge_df.iloc[:, 0:13].head(3)
Out[9]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state order_item_id product_id seller_id
0 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP 1.0 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9
1 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP 1.0 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9
2 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP 1.0 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9
In [10]:
merge_df.iloc[:, 13:].head(3)
Out[10]:
price shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
0 29.99 8.72 1.0 credit_card 1.0 18.12 housewares 500.0 19.0 8.0 13.0
1 29.99 8.72 3.0 voucher 1.0 2.00 housewares 500.0 19.0 8.0 13.0
2 29.99 8.72 2.0 voucher 1.0 18.59 housewares 500.0 19.0 8.0 13.0

수치형:

  • order_item_id
  • price
  • shipping_charges
  • payment_sequential : 애매하긴 한데... 일단 수치형
  • payment_installments
  • payment_value
  • product_weight_g, product_length_cm, product_height_cm, product_width_cm

범주형:

  • order_status
  • order_purchase_timestamp, order_approved_at, order_delivered_timestamp, order_estimated_delivery_date : 나중에 분기별로 혹은 달별로 범주를 나눠 사용하면 좋을듯
  • customer_zip_code_prefix, customer_city, customer_state
  • payment_type
  • product_category_name
In [11]:
numerical_cols = ['order_item_id', 'price', 'shipping_charges', 'payment_sequential', 'payment_installments', 'payment_value',
                           'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']
categorical_cols = ['order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_timestamp', 
                             'order_estimated_delivery_date', 'customer_zip_code_prefix', 'customer_city', 'customer_state',
                             'payment_type', 'product_category_name']
In [12]:
# 수치형 변수에 대한 value_counts

numerical_df = merge_df[numerical_cols]

for col in numerical_df.columns:
    print(numerical_df[col].value_counts().to_frame().join(numerical_df[col].value_counts(normalize=True).to_frame().cumsum()))
    print("=" * 50)
                count  proportion
order_item_id                    
1.0            103656    0.876028
2.0             10314    0.963195
3.0              2403    0.983503
4.0               995    0.991912
5.0               473    0.995910
6.0               264    0.998141
7.0                62    0.998665
8.0                37    0.998977
9.0                29    0.999222
10.0               26    0.999442
11.0               17    0.999586
12.0               13    0.999696
13.0                8    0.999763
14.0                7    0.999823
15.0                5    0.999865
16.0                3    0.999890
17.0                3    0.999915
18.0                3    0.999941
19.0                3    0.999966
20.0                3    0.999992
21.0                1    1.000000
==================================================
        count  proportion
price                    
59.90    2628    0.022210
69.90    2116    0.040093
49.90    2049    0.057410
89.90    1632    0.071202
99.90    1531    0.084141
...       ...         ...
424.90      1    0.999966
234.80      1    0.999975
119.95      1    0.999983
107.94      1    0.999992
213.39      1    1.000000

[5968 rows x 2 columns]
==================================================
                  count  proportion
shipping_charges                   
15.10              3856    0.032588
7.78               2353    0.052474
14.10              1989    0.069284
11.85              1982    0.086034
18.23              1631    0.099818
...                 ...         ...
39.37                 1    0.999966
49.03                 1    0.999975
40.72                 1    0.999983
48.10                 1    0.999992
36.89                 1    1.000000

[6999 rows x 2 columns]
==================================================
                     count  proportion
payment_sequential                    
1.0                 114011    0.956813
2.0                   3430    0.985599
3.0                    662    0.991155
4.0                    320    0.993840
5.0                    192    0.995451
6.0                    134    0.996576
7.0                     92    0.997348
8.0                     61    0.997860
9.0                     50    0.998280
10.0                    41    0.998624
11.0                    35    0.998917
12.0                    27    0.999144
13.0                    16    0.999278
14.0                    13    0.999387
15.0                    11    0.999480
16.0                     9    0.999555
18.0                     9    0.999631
17.0                     9    0.999706
19.0                     9    0.999782
21.0                     6    0.999832
20.0                     6    0.999883
22.0                     3    0.999908
25.0                     2    0.999924
26.0                     2    0.999941
23.0                     2    0.999958
24.0                     2    0.999975
27.0                     1    0.999983
29.0                     1    0.999992
28.0                     1    1.000000
==================================================
                      count  proportion
payment_installments                   
1.0                   59438    0.498821
2.0                   13865    0.615180
3.0                   11883    0.714906
4.0                    8073    0.782656
10.0                   6975    0.841193
5.0                    6102    0.892402
8.0                    5109    0.935279
6.0                    4670    0.974471
7.0                    1857    0.990055
9.0                     747    0.996324
12.0                    166    0.997717
15.0                     94    0.998506
18.0                     38    0.998825
24.0                     34    0.999110
11.0                     26    0.999329
20.0                     21    0.999505
13.0                     18    0.999656
14.0                     16    0.999790
17.0                      8    0.999857
16.0                      7    0.999916
21.0                      5    0.999958
0.0                       3    0.999983
23.0                      1    0.999992
22.0                      1    1.000000
==================================================
               count  proportion
payment_value                   
50.00            351    0.002946
100.00           302    0.005480
20.00            289    0.007906
77.57            251    0.010012
35.00            167    0.011414
...              ...         ...
202.55             1    0.999966
268.97             1    0.999975
381.76             1    0.999983
202.53             1    0.999992
182.39             1    1.000000

[29077 rows x 2 columns]
==================================================
                  count  proportion
product_weight_g                   
200.0              7092    0.059947
150.0              5414    0.105710
250.0              4727    0.145666
300.0              4444    0.183230
400.0              3780    0.215181
...                 ...         ...
726.0                 1    0.999966
8575.0                1    0.999975
3598.0                1    0.999983
11025.0               1    0.999992
2676.0                1    1.000000

[2204 rows x 2 columns]
==================================================
                   count  proportion
product_length_cm                   
16.0               18420    0.155699
20.0               10976    0.248476
30.0                7954    0.315709
17.0                6208    0.368184
18.0                5902    0.418072
...                  ...         ...
83.0                   8    0.999831
96.0                   8    0.999899
94.0                   6    0.999949
9.0                    4    0.999983
8.0                    2    1.000000

[99 rows x 2 columns]
==================================================
                   count  proportion
product_height_cm                   
10.0               10375    0.087697
20.0                6935    0.146317
15.0                6880    0.204471
12.0                6533    0.259693
11.0                6427    0.314019
...                  ...         ...
98.0                   3    0.999932
92.0                   3    0.999958
94.0                   2    0.999975
97.0                   2    0.999992
89.0                   1    1.000000

[102 rows x 2 columns]
==================================================
                  count  proportion
product_width_cm                   
20.0              12727    0.107578
11.0              11140    0.201741
15.0               9373    0.280969
16.0               8859    0.355851
30.0               8046    0.423862
...                 ...         ...
103.0                 1    0.999966
97.0                  1    0.999975
104.0                 1    0.999983
98.0                  1    0.999992
86.0                  1    1.000000

[95 rows x 2 columns]
==================================================
In [13]:
# 수치형 변수 기술통계량

numerical_df.describe()
Out[13]:
order_item_id price shipping_charges payment_sequential payment_installments payment_value product_weight_g product_length_cm product_height_cm product_width_cm
count 118325.000000 118325.000000 118325.000000 119157.000000 119157.000000 119157.000000 118305.000000 118305.000000 118305.000000 118305.000000
mean 1.196569 120.729200 20.037466 1.094178 2.941053 172.825878 2113.957364 30.254537 16.633143 23.066692
std 0.699144 184.130919 15.847918 0.726675 2.777111 267.813648 3789.273735 16.189991 13.453618 11.743808
min 1.000000 0.850000 0.000000 1.000000 0.000000 0.000000 0.000000 7.000000 2.000000 6.000000
25% 1.000000 39.900000 13.080000 1.000000 1.000000 60.890000 300.000000 18.000000 8.000000 15.000000
50% 1.000000 74.900000 16.280000 1.000000 2.000000 108.210000 700.000000 25.000000 13.000000 20.000000
75% 1.000000 134.900000 21.190000 1.000000 4.000000 189.260000 1800.000000 38.000000 20.000000 30.000000
max 21.000000 6735.000000 409.680000 29.000000 24.000000 13664.080000 40425.000000 105.000000 105.000000 118.000000
In [14]:
# 범주형 변수에 대한 value_counts

categorical_df = merge_df[categorical_cols]

for col in categorical_df.columns:
    print(categorical_df[col].value_counts().to_frame().join(categorical_df[col].value_counts(normalize=True).to_frame().cumsum()))
    print("=" * 50)
               count  proportion
order_status                    
delivered     115739    0.971291
shipped         1257    0.981840
canceled         752    0.988150
unavailable      653    0.993630
processing       376    0.996786
invoiced         375    0.999933
created            5    0.999975
approved           3    1.000000
==================================================
                          count  proportion
order_purchase_timestamp                   
8/8/2017 20:26               63    0.000529
9/23/2017 14:56              38    0.000848
8/2/2018 12:06               35    0.001141
8/2/2018 12:05               31    0.001401
4/20/2017 12:45              29    0.001645
...                         ...         ...
8/17/2018 7:44                1    0.999966
12/26/2017 14:16              1    0.999975
7/6/2018 19:14                1    0.999983
10/15/2017 18:32              1    0.999992
3/8/2018 20:57                1    1.000000

[88789 rows x 2 columns]
==================================================
                   count  proportion
order_approved_at                   
1/10/2018 10:32      121    0.001017
12/1/2017 11:31       94    0.001807
7/24/2018 10:31       88    0.002547
11/7/2017 7:30        87    0.003278
2/27/2018 4:31        82    0.003967
...                  ...         ...
8/23/2017 20:15        1    0.999966
7/3/2017 16:30         1    0.999975
7/28/2018 0:25         1    0.999983
10/7/2016 23:13        1    0.999992
9/14/2017 12:30        1    1.000000

[50462 rows x 2 columns]
==================================================
                           count  proportion
order_delivered_timestamp                   
8/14/2017 12:46               63    0.000544
10/18/2017 22:35              38    0.000873
3/5/2018 15:22                30    0.001132
6/22/2017 16:04               27    0.001365
2/28/2018 20:09               26    0.001590
...                          ...         ...
1/31/2018 19:16                1    0.999965
11/8/2017 13:42                1    0.999974
6/5/2018 0:44                  1    0.999983
7/27/2018 19:56                1    0.999991
3/16/2018 13:08                1    1.000000

[75649 rows x 2 columns]
==================================================
                               count  proportion
order_estimated_delivery_date                   
12/20/2017 0:00                  658    0.005522
3/12/2018 0:00                   617    0.010700
3/13/2018 0:00                   616    0.015869
5/29/2018 0:00                   615    0.021031
5/30/2018 0:00                   593    0.026007
...                              ...         ...
11/14/2016 0:00                    1    0.999966
11/7/2016 0:00                     1    0.999975
1/9/2017 0:00                      1    0.999983
10/28/2016 0:00                    1    0.999992
10/27/2016 0:00                    1    1.000000

[459 rows x 2 columns]
==================================================
                          count  proportion
customer_zip_code_prefix                   
24220                       163    0.001368
22790                       156    0.002677
22793                       155    0.003978
24230                       139    0.005144
22775                       132    0.006252
...                         ...         ...
58086                         1    0.999966
68798                         1    0.999975
55365                         1    0.999983
89086                         1    0.999992
45920                         1    1.000000

[14994 rows x 2 columns]
==================================================
                     count  proportion
customer_city                         
sao paulo            18864    0.158308
rio de janeiro        8322    0.228147
belo horizonte        3310    0.255925
brasilia              2488    0.276804
curitiba              1827    0.292137
...                    ...         ...
igrapiuna                1    0.999966
nantes                   1    0.999975
carnauba dos dantas      1    0.999983
satiro dias              1    0.999992
nova vicosa              1    1.000000

[4119 rows x 2 columns]
==================================================
                count  proportion
customer_state                   
SP              50305    0.422163
RJ              15527    0.552467
MG              13826    0.668496
RS               6542    0.723397
PR               6038    0.774068
SC               4355    0.810616
BA               4091    0.844948
DF               2505    0.865970
GO               2461    0.886623
ES               2366    0.906479
PE               1905    0.922466
CE               1567    0.935616
MT               1131    0.945107
PA               1124    0.954540
MA                859    0.961749
MS                850    0.968882
PB                645    0.974295
PI                577    0.979137
RN                574    0.983954
AL                465    0.987857
SE                409    0.991289
TO                340    0.994142
RO                293    0.996601
AM                174    0.998061
AC                 95    0.998859
AP                 84    0.999564
RR                 52    1.000000
==================================================
              count  proportion
payment_type                   
credit_card   87795    0.736801
wallet        23177    0.931309
voucher        6475    0.985649
debit_card     1707    0.999975
not_defined       3    1.000000
==================================================
                                   count  proportion
product_category_name                               
toys                               88791    0.753149
health_beauty                       3143    0.779809
bed_bath_table                      2752    0.803152
sports_leisure                      2404    0.823543
computers_accessories               2275    0.842841
...                                  ...         ...
fashion_childrens_clothes              2    0.999966
diapers_and_hygiene                    1    0.999975
home_comfort_2                         1    0.999983
furniture_mattress_and_upholstery      1    0.999992
security_and_services                  1    1.000000

[70 rows x 2 columns]
==================================================
In [15]:
numerical_df.describe(include='all')
Out[15]:
order_item_id price shipping_charges payment_sequential payment_installments payment_value product_weight_g product_length_cm product_height_cm product_width_cm
count 118325.000000 118325.000000 118325.000000 119157.000000 119157.000000 119157.000000 118305.000000 118305.000000 118305.000000 118305.000000
mean 1.196569 120.729200 20.037466 1.094178 2.941053 172.825878 2113.957364 30.254537 16.633143 23.066692
std 0.699144 184.130919 15.847918 0.726675 2.777111 267.813648 3789.273735 16.189991 13.453618 11.743808
min 1.000000 0.850000 0.000000 1.000000 0.000000 0.000000 0.000000 7.000000 2.000000 6.000000
25% 1.000000 39.900000 13.080000 1.000000 1.000000 60.890000 300.000000 18.000000 8.000000 15.000000
50% 1.000000 74.900000 16.280000 1.000000 2.000000 108.210000 700.000000 25.000000 13.000000 20.000000
75% 1.000000 134.900000 21.190000 1.000000 4.000000 189.260000 1800.000000 38.000000 20.000000 30.000000
max 21.000000 6735.000000 409.680000 29.000000 24.000000 13664.080000 40425.000000 105.000000 105.000000 118.000000

2-2. 데이터 시각화¶

수치형 시각화 방법:

  • 히스토그램 : histplot()
  • 커널밀도추정함수 그래프 : kdeplot()
  • 분포도 : displot()
  • 박스플롯 : boxplot()
  • 페어플롯 : pairplot()

범주형 시각화 방법:

  • 막대그래프 : barplot()
  • 카운트플롯 : countplot()
  • 박스플롯 : boxplot()

데이터 상관관계 시각화 방법: 데이터 전처리 후 볼 예정

  • 히트맵 : heatmpap()
  • 라인플롯 : lineplot()
  • 산점도 : scatterplot()
  • 회귀선을 포함한 산점도 그래프 : regplot()

2-2-1. 수치형 시각화¶

In [16]:
# 수치형 히스토그램
plt.figure(figsize=(20, 10))

for idx, col in enumerate(numerical_cols):
    plt.subplot(3, 4, idx+1)
    ax = sns.histplot(merge_df[col], kde=False)
    plt.title(f'Histogram of {col}')

plt.tight_layout()  
plt.show()
No description has been provided for this image
In [17]:
# 수치형 kdeplot()

plt.figure(figsize=(20, 10))

for idx, col in enumerate(numerical_cols):
    plt.subplot(3, 4, idx+1)
    ax = sns.kdeplot(merge_df[col], shade=True)
    plt.title(f'KDE Plot of {col}')

plt.tight_layout()  
plt.show()
No description has been provided for this image
In [18]:
# 수치형 boxplot

plt.figure(figsize=(20, 10))

for idx, col in enumerate(numerical_cols):
    plt.subplot(3, 4, idx+1)
    sns.boxplot(y=merge_df[col])  
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col) 
    plt.ylabel('Value')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [19]:
# 수치형 pairplot

sns.pairplot(merge_df[numerical_cols], diag_kind='kde')  
plt.suptitle('Pairplot of Numerical Variables', y=1.02)  
plt.show()
No description has been provided for this image

수치형 변수 시각화 해석:

  • 시각화 결과 모든 column에서 데이터의 왼쪽 쏠림 현상 발견.
  • 데이터의 표준화가 필요할 거 같다.
  • 박스플롯을 살펴봤을 때 이상치가 많아 정제가 필요할 거 같다.

2-2-2. 범주형 시각화¶

In [20]:
# 범주형 countplot 상위 10개의 값만 시각화

rows = 5
cols = 2

# 플롯 크기 설정
fig, axes = plt.subplots(rows, cols, figsize=(15, 25))

# 모든 플롯을 차지할 수 있도록 축 배열 평탄화
axes = axes.flatten()

# 각 범주형 변수에 대해 상위 10개 값의 countplot 그리기
for i, col in enumerate(categorical_cols):
    top_10_values = merge_df[col].value_counts().nlargest(10).index
    sns.countplot(data=merge_df[merge_df[col].isin(top_10_values)], x=col, ax=axes[i])
    axes[i].set_title(f'Top 10 Values in {col}')
    axes[i].set_ylabel('Count')
    axes[i].set_xlabel(col)
    axes[i].tick_params(axis='x', rotation=45)

# 레이아웃 조정
plt.tight_layout()
plt.show()
No description has been provided for this image

3. Data Preprocessing¶

3-1. 결측치 처리¶

In [21]:
# 결측치를 포함하는 column

merge_df[missing_columns].isna().sum()
Out[21]:
order_approved_at             178
order_delivered_timestamp    3422
order_item_id                 835
product_id                    835
seller_id                     835
price                         835
shipping_charges              835
payment_sequential              3
payment_type                    3
payment_installments            3
payment_value                   3
product_category_name        1267
product_weight_g              855
product_length_cm             855
product_height_cm             855
product_width_cm              855
dtype: int64
In [22]:
# missing_col_df에 대한 기술통계량

missing_col_df = merge_df[missing_columns]

missing_col_df.describe(include='all')
Out[22]:
order_approved_at order_delivered_timestamp order_item_id product_id seller_id price shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
count 118982 115738 118325.000000 118325 118325 118325.000000 118325.000000 119157.000000 119157 119157.000000 119157.000000 117893 118305.000000 118305.000000 118305.000000 118305.000000
unique 50462 75649 NaN 32951 3095 NaN NaN NaN 5 NaN NaN 70 NaN NaN NaN NaN
top 1/10/2018 10:32 8/14/2017 12:46 NaN aca2eb7d00ea1a7b8ebd4e68314663af 4a3ca9315b744ce9f8e9374361493884 NaN NaN NaN credit_card NaN NaN toys NaN NaN NaN NaN
freq 121 63 NaN 538 2155 NaN NaN NaN 87795 NaN NaN 88791 NaN NaN NaN NaN
mean NaN NaN 1.196569 NaN NaN 120.729200 20.037466 1.094178 NaN 2.941053 172.825878 NaN 2113.957364 30.254537 16.633143 23.066692
std NaN NaN 0.699144 NaN NaN 184.130919 15.847918 0.726675 NaN 2.777111 267.813648 NaN 3789.273735 16.189991 13.453618 11.743808
min NaN NaN 1.000000 NaN NaN 0.850000 0.000000 1.000000 NaN 0.000000 0.000000 NaN 0.000000 7.000000 2.000000 6.000000
25% NaN NaN 1.000000 NaN NaN 39.900000 13.080000 1.000000 NaN 1.000000 60.890000 NaN 300.000000 18.000000 8.000000 15.000000
50% NaN NaN 1.000000 NaN NaN 74.900000 16.280000 1.000000 NaN 2.000000 108.210000 NaN 700.000000 25.000000 13.000000 20.000000
75% NaN NaN 1.000000 NaN NaN 134.900000 21.190000 1.000000 NaN 4.000000 189.260000 NaN 1800.000000 38.000000 20.000000 30.000000
max NaN NaN 21.000000 NaN NaN 6735.000000 409.680000 29.000000 NaN 24.000000 13664.080000 NaN 40425.000000 105.000000 105.000000 118.000000
In [23]:
# missing_columns에 대한 value_counts

for col in missing_col_df.columns:
    print(missing_col_df[col].value_counts().to_frame().join(missing_col_df[col].value_counts(normalize=True).to_frame().cumsum()))
    print("=" * 50)
                   count  proportion
order_approved_at                   
1/10/2018 10:32      121    0.001017
12/1/2017 11:31       94    0.001807
7/24/2018 10:31       88    0.002547
11/7/2017 7:30        87    0.003278
2/27/2018 4:31        82    0.003967
...                  ...         ...
8/23/2017 20:15        1    0.999966
7/3/2017 16:30         1    0.999975
7/28/2018 0:25         1    0.999983
10/7/2016 23:13        1    0.999992
9/14/2017 12:30        1    1.000000

[50462 rows x 2 columns]
==================================================
                           count  proportion
order_delivered_timestamp                   
8/14/2017 12:46               63    0.000544
10/18/2017 22:35              38    0.000873
3/5/2018 15:22                30    0.001132
6/22/2017 16:04               27    0.001365
2/28/2018 20:09               26    0.001590
...                          ...         ...
1/31/2018 19:16                1    0.999965
11/8/2017 13:42                1    0.999974
6/5/2018 0:44                  1    0.999983
7/27/2018 19:56                1    0.999991
3/16/2018 13:08                1    1.000000

[75649 rows x 2 columns]
==================================================
                count  proportion
order_item_id                    
1.0            103656    0.876028
2.0             10314    0.963195
3.0              2403    0.983503
4.0               995    0.991912
5.0               473    0.995910
6.0               264    0.998141
7.0                62    0.998665
8.0                37    0.998977
9.0                29    0.999222
10.0               26    0.999442
11.0               17    0.999586
12.0               13    0.999696
13.0                8    0.999763
14.0                7    0.999823
15.0                5    0.999865
16.0                3    0.999890
17.0                3    0.999915
18.0                3    0.999941
19.0                3    0.999966
20.0                3    0.999992
21.0                1    1.000000
==================================================
                                  count  proportion
product_id                                         
aca2eb7d00ea1a7b8ebd4e68314663af    538    0.004547
99a4788cb24856965c36a24e339b6058    527    0.009001
422879e10f46682990de24d770e7f83d    506    0.013277
389d119b48cf3043d311335e499d9c6b    407    0.016717
368c6c730842d78016ad823897a372db    400    0.020097
...                                 ...         ...
66ba269ae7ed286b181ebfb76d1d989e      1    0.999966
f4a2e2f1cf569d9cee134511bdd0f969      1    0.999975
5f79ee3b64a3af922735487fb06da429      1    0.999983
6d3b335d1fba472b61c01ba542307f62      1    0.999992
006619bbed68b000c8ba3f8725d5409e      1    1.000000

[32951 rows x 2 columns]
==================================================
                                  count  proportion
seller_id                                          
4a3ca9315b744ce9f8e9374361493884   2155    0.018213
6560211a19b47992c3666cc44a7e94c0   2132    0.036231
1f50f920176fa81dab994f9023523100   2019    0.053294
cc419e0650a3c5ba77189a1882b7556a   1857    0.068988
da8622b14eb17ae2831f4ac5b9dab84a   1646    0.082899
...                                 ...         ...
c74f14c1e26cf1bd5bd34b6a1e274426      1    0.999966
4ce6e5f6c52515177e18c1c9361d8677      1    0.999975
f90f77ef2799a27f80d90c425ca944f7      1    0.999983
c3e2398fcc7e581cda2e546557bf6968      1    0.999992
f3862c2188522d89860c38a3ea8b550d      1    1.000000

[3095 rows x 2 columns]
==================================================
        count  proportion
price                    
59.90    2628    0.022210
69.90    2116    0.040093
49.90    2049    0.057410
89.90    1632    0.071202
99.90    1531    0.084141
...       ...         ...
424.90      1    0.999966
234.80      1    0.999975
119.95      1    0.999983
107.94      1    0.999992
213.39      1    1.000000

[5968 rows x 2 columns]
==================================================
                  count  proportion
shipping_charges                   
15.10              3856    0.032588
7.78               2353    0.052474
14.10              1989    0.069284
11.85              1982    0.086034
18.23              1631    0.099818
...                 ...         ...
39.37                 1    0.999966
49.03                 1    0.999975
40.72                 1    0.999983
48.10                 1    0.999992
36.89                 1    1.000000

[6999 rows x 2 columns]
==================================================
                     count  proportion
payment_sequential                    
1.0                 114011    0.956813
2.0                   3430    0.985599
3.0                    662    0.991155
4.0                    320    0.993840
5.0                    192    0.995451
6.0                    134    0.996576
7.0                     92    0.997348
8.0                     61    0.997860
9.0                     50    0.998280
10.0                    41    0.998624
11.0                    35    0.998917
12.0                    27    0.999144
13.0                    16    0.999278
14.0                    13    0.999387
15.0                    11    0.999480
16.0                     9    0.999555
18.0                     9    0.999631
17.0                     9    0.999706
19.0                     9    0.999782
21.0                     6    0.999832
20.0                     6    0.999883
22.0                     3    0.999908
25.0                     2    0.999924
26.0                     2    0.999941
23.0                     2    0.999958
24.0                     2    0.999975
27.0                     1    0.999983
29.0                     1    0.999992
28.0                     1    1.000000
==================================================
              count  proportion
payment_type                   
credit_card   87795    0.736801
wallet        23177    0.931309
voucher        6475    0.985649
debit_card     1707    0.999975
not_defined       3    1.000000
==================================================
                      count  proportion
payment_installments                   
1.0                   59438    0.498821
2.0                   13865    0.615180
3.0                   11883    0.714906
4.0                    8073    0.782656
10.0                   6975    0.841193
5.0                    6102    0.892402
8.0                    5109    0.935279
6.0                    4670    0.974471
7.0                    1857    0.990055
9.0                     747    0.996324
12.0                    166    0.997717
15.0                     94    0.998506
18.0                     38    0.998825
24.0                     34    0.999110
11.0                     26    0.999329
20.0                     21    0.999505
13.0                     18    0.999656
14.0                     16    0.999790
17.0                      8    0.999857
16.0                      7    0.999916
21.0                      5    0.999958
0.0                       3    0.999983
23.0                      1    0.999992
22.0                      1    1.000000
==================================================
               count  proportion
payment_value                   
50.00            351    0.002946
100.00           302    0.005480
20.00            289    0.007906
77.57            251    0.010012
35.00            167    0.011414
...              ...         ...
202.55             1    0.999966
268.97             1    0.999975
381.76             1    0.999983
202.53             1    0.999992
182.39             1    1.000000

[29077 rows x 2 columns]
==================================================
                                   count  proportion
product_category_name                               
toys                               88791    0.753149
health_beauty                       3143    0.779809
bed_bath_table                      2752    0.803152
sports_leisure                      2404    0.823543
computers_accessories               2275    0.842841
...                                  ...         ...
fashion_childrens_clothes              2    0.999966
diapers_and_hygiene                    1    0.999975
home_comfort_2                         1    0.999983
furniture_mattress_and_upholstery      1    0.999992
security_and_services                  1    1.000000

[70 rows x 2 columns]
==================================================
                  count  proportion
product_weight_g                   
200.0              7092    0.059947
150.0              5414    0.105710
250.0              4727    0.145666
300.0              4444    0.183230
400.0              3780    0.215181
...                 ...         ...
726.0                 1    0.999966
8575.0                1    0.999975
3598.0                1    0.999983
11025.0               1    0.999992
2676.0                1    1.000000

[2204 rows x 2 columns]
==================================================
                   count  proportion
product_length_cm                   
16.0               18420    0.155699
20.0               10976    0.248476
30.0                7954    0.315709
17.0                6208    0.368184
18.0                5902    0.418072
...                  ...         ...
83.0                   8    0.999831
96.0                   8    0.999899
94.0                   6    0.999949
9.0                    4    0.999983
8.0                    2    1.000000

[99 rows x 2 columns]
==================================================
                   count  proportion
product_height_cm                   
10.0               10375    0.087697
20.0                6935    0.146317
15.0                6880    0.204471
12.0                6533    0.259693
11.0                6427    0.314019
...                  ...         ...
98.0                   3    0.999932
92.0                   3    0.999958
94.0                   2    0.999975
97.0                   2    0.999992
89.0                   1    1.000000

[102 rows x 2 columns]
==================================================
                  count  proportion
product_width_cm                   
20.0              12727    0.107578
11.0              11140    0.201741
15.0               9373    0.280969
16.0               8859    0.355851
30.0               8046    0.423862
...                 ...         ...
103.0                 1    0.999966
97.0                  1    0.999975
104.0                 1    0.999983
98.0                  1    0.999992
86.0                  1    1.000000

[95 rows x 2 columns]
==================================================

order_approved_at¶

In [24]:
# 'order_approved_at' 컬럼에서 결측치가 포함된 행만 추출
missing_rows = merge_df[merge_df['order_approved_at'].isna()]

# 'order_status' 컬럼의 value_counts
order_status_counts = missing_rows['order_status'].value_counts()
print("Order Status Counts:")
print(order_status_counts)
Order Status Counts:
order_status
canceled     158
delivered     15
created        5
Name: count, dtype: int64
  • canceled 인 경우 당연하게도 주문이 승인이 되지 않기 때문에 결측치가 발생하고 있었다.
  • delivered 됐는데 승인이 되지 않은 경우가 15건 있는데, 시스템 오류로 발생한 결측치로 보는 것이 타당하다고 생각.
  • created 또한 마찬가지로 시스템 오류로 보는 것이 합당할 듯

결론: order_approved_at 의 결측치는 178개인데, 전체 데이터에 비하면 그 비율이 너무 작기에 그것을 알아내는 비용이 이유를 알아냈을 때 얻을 수 있는 이득보다 더 클 것이라 생각되어 행을 제거하면 어떨까?

order_delivered_timestamp¶

In [25]:
# 'order_approved_at' 컬럼에서 결측치가 포함된 행만 추출
missing_rows = merge_df[merge_df['order_delivered_timestamp'].isna()]

# 'order_status' 컬럼의 value_counts
order_status_counts = missing_rows['order_status'].value_counts()
print("Order Status Counts:")
print(order_status_counts)
Order Status Counts:
order_status
shipped        1257
canceled        745
unavailable     653
processing      376
invoiced        375
delivered         8
created           5
approved          3
Name: count, dtype: int64

order_status가 delivered 가 아닌 경우 당연하게도 배송된 시점의 타임스탬프는 비어있어야한다. 문제는 delivered 됐는데도 타임스탬프가 비어있는 경우이다.

  • 하지만 그 개수가 8개로 행 제거.
  • 그리고 나머지 status에 대해서 결측치가 있는 부분은 전자상거래 도메인에 대한 이해가 필요할거 같다.

결론: delivered를 제외한 나머지 상태에서는 아직 도착하지 않았으니 비어있는게 당연하다. 그리고 결측치 개수가 3,422개로 전체 데이터와 비교하면 약 3%정도 된다. 따라서 비어있는 부분은 그대로 두고 나머지 데이터들에 대해서만 분석을 진행해도 되지 않을까?

order_item_id ~ shipping_charges¶

  • order_item_id, product_id, seller_id, price, shipping_charges
In [26]:
# columns_to_check에 대한 결측치들이 모두 835개로 서로 관계가 있을 것이라 생각해 같이 살펴보기로 함.


# 체크할 컬럼들 정의
columns_to_check = ['order_item_id', 'product_id', 'seller_id', 'price', 'shipping_charges']

# 각 컬럼에 대해 결측치가 포함된 행들의 'order_status'에 대한 value_counts 계산
for column in columns_to_check:
    # 각 컬럼에 대해 결측치가 있는 행만 필터링
    missing_rows = merge_df[merge_df[column].isna()]
    
    # 해당 행들의 'order_status' value_counts 계산
    order_status_counts = missing_rows['order_status'].value_counts()
    
    # 결과 출력
    print(f"Order Status Value Counts for Rows with Missing '{column}':")
    print(order_status_counts)
    print("-" * 50)  # 구분선 출력
Order Status Value Counts for Rows with Missing 'order_item_id':
order_status
unavailable    646
canceled       181
created          5
invoiced         2
shipped          1
Name: count, dtype: int64
--------------------------------------------------
Order Status Value Counts for Rows with Missing 'product_id':
order_status
unavailable    646
canceled       181
created          5
invoiced         2
shipped          1
Name: count, dtype: int64
--------------------------------------------------
Order Status Value Counts for Rows with Missing 'seller_id':
order_status
unavailable    646
canceled       181
created          5
invoiced         2
shipped          1
Name: count, dtype: int64
--------------------------------------------------
Order Status Value Counts for Rows with Missing 'price':
order_status
unavailable    646
canceled       181
created          5
invoiced         2
shipped          1
Name: count, dtype: int64
--------------------------------------------------
Order Status Value Counts for Rows with Missing 'shipping_charges':
order_status
unavailable    646
canceled       181
created          5
invoiced         2
shipped          1
Name: count, dtype: int64
--------------------------------------------------

역시 columns_to_check에 해당하는 모든 컬럼들은 order_status의 영향을 받고 있었다.

  • 그 중에서도 총 835개 중 646개가 unavailable에 속했다.
    • unavailable에 속할 경우를 아래와 같이 살펴보면 order_purchase_timestamp, order_approved_at 에는 값이 있는 것으로 보아 판매자 측에서 재고 부족 등의 문제로 취소를 한 경우에 해당한다고 생각.
  • canceled도 살펴봤을 때 order_purchase_timestamp, order_approved_at에 값이 있는 것으로 보아 본인이 취소한 경우에 속할 것이다.

결론: 이러한 결과들로 살펴봤을 때 본 프로젝트에서는 유효한 주문에 대해서만 분석을 진행하기 위해 835개의 행을 모두 제거하는 것이 좋지 않을까 생각

In [27]:
merge_df[merge_df['order_status'] == 'unavailable'].head(3)
Out[27]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
320 8e24261a7e58791d10cb1bf9da94df5c 41fc647b8c6bd979b1b6364b60471b50 unavailable 11/16/2017 15:09 11/16/2017 15:26 NaN 12/5/2017 0:00 89288 sao bento do sul SC ... NaN 1.0 credit_card 5.0 84.00 NaN NaN NaN NaN NaN
701 c272bcd21c287498b4883c7512019702 0e634b16e4c585acbd7b2e8276ce6677 unavailable 1/31/2018 11:31 1/31/2018 14:23 NaN 2/16/2018 0:00 11701 praia grande SP ... NaN 1.0 credit_card 1.0 97.68 NaN NaN NaN NaN NaN
823 37553832a3a89c9b2db59701c357ca67 596ed6d7a35890b3fbac54ec01f69685 unavailable 8/14/2017 17:38 8/17/2017 0:15 NaN 9/5/2017 0:00 2318 sao paulo SP ... NaN 1.0 wallet 1.0 132.46 NaN NaN NaN NaN NaN

3 rows × 24 columns

In [28]:
merge_df[merge_df['order_status'] == 'canceled'].head(3)
Out[28]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
471 1b9ecfe83cdc259250e1a8aca174f0ad 8ea097b1824dbd4d17af71b0afe04301 canceled 8/4/2018 14:29 8/7/2018 4:10 NaN 8/14/2018 0:00 5514 sao paulo SP ... 8.34 1.0 wallet 1.0 33.34 toys 350.0 22.0 11.0 17.0
737 714fb133a6730ab81fa1d3c1b2007291 c4ebedb09beb89cc0314c5c0c33f8053 canceled 1/26/2018 21:34 1/26/2018 21:58 NaN 2/22/2018 0:00 9961 diadema SP ... 26.11 1.0 credit_card 4.0 96.01 toys 12300.0 40.0 40.0 40.0
1267 3a129877493c8189c59c60eb71d97c29 11c3d6c93dea6ce86769a4835c171bd9 canceled 1/25/2018 13:34 1/25/2018 13:50 NaN 2/23/2018 0:00 87013 maringa PR ... 15.10 1.0 credit_card 1.0 14.12 toys 329.0 20.0 16.0 18.0

3 rows × 24 columns

payment_sequential ~ payment_value¶

  • payment_sequential, payment_type, payment_installments, payment_value
In [29]:
# columns_to_check2에 대한 결측치들이 모두 3개로 서로 관계가 있을 것이라 생각해 같이 살펴보기로 함.

columns_to_check2 = ['payment_sequential', 'payment_type', 'payment_installments', 'payment_value']

merge_df[merge_df[columns_to_check2].isna().any(axis=1)]
Out[29]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm
36884 bfbd0f9bdef84302105ad712db648a6c 830d5b7aaa3b6f1e9ad63703bec97d23 delivered 9/15/2016 12:16 9/15/2016 12:16 11/9/2016 7:47 10/4/2016 0:00 14600 sao joaquim da barra SP ... 2.83 NaN NaN NaN NaN toys 1000.0 16.0 16.0 16.0
36885 bfbd0f9bdef84302105ad712db648a6c 830d5b7aaa3b6f1e9ad63703bec97d23 delivered 9/15/2016 12:16 9/15/2016 12:16 11/9/2016 7:47 10/4/2016 0:00 14600 sao joaquim da barra SP ... 2.83 NaN NaN NaN NaN toys 1000.0 16.0 16.0 16.0
36886 bfbd0f9bdef84302105ad712db648a6c 830d5b7aaa3b6f1e9ad63703bec97d23 delivered 9/15/2016 12:16 9/15/2016 12:16 11/9/2016 7:47 10/4/2016 0:00 14600 sao joaquim da barra SP ... 2.83 NaN NaN NaN NaN toys 1000.0 16.0 16.0 16.0

3 rows × 24 columns

데이터를 살펴봤을 때 완전히 동일한 데이터이다.

결론: 행 제거

product_category_name¶

In [30]:
# product_category_name이 결측치인 행의 product_id 추출

product_ids = merge_df[merge_df['product_category_name'].isna()]['product_id'].tolist()
product_ids
Out[30]:
['71225f49be70df4297892f6a5fa62171',
 '9820e04e332fc450d6fd975befc1bc28',
 nan,
 nan,
 nan,
 '0ae10ca52e6cb8745e17c11e7c43907c',
 nan,
 nan,
 nan,
 'c4d92c32b2e02363d3f087c7156a523c',
 '45a15b38cc3c0514717a1de673c6193c',
 nan,
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '0105b5323d24fc655f73052694dbbb3a',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '15f773b4fea5b2ab681b02620b0c3a95',
 '15f773b4fea5b2ab681b02620b0c3a95',
 nan,
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '6627f494b45486e462e2bb460fcb18e5',
 'f58e45b16a42a325c144eb2c46a2bc57',
 '06ddfdf210c7e0259854ee543215088d',
 'b61d1388a17e3f547d2bc218df02335b',
 'fcde1899043318107d8f0ae043f69bc3',
 nan,
 nan,
 nan,
 nan,
 nan,
 '817a68e75e8fc94fa28915d8169d67bf',
 nan,
 nan,
 'c7858457bc43561ccb96aaa2932cb085',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '6ad6cc284cdcc1cdd2f06eef8d983913',
 '56139431d72cd51f19eb9f7dae4d1617',
 nan,
 nan,
 nan,
 nan,
 nan,
 '52a5346c9dae9cd560b103875b94d123',
 'c230b471b7e21ff9060e68ee154afd70',
 'a08ffa6ecdd0ab3c1f4fe348acc2553b',
 nan,
 '40a1e3c65a0bcf6f4ebba840a8156ba2',
 nan,
 nan,
 'b20dff2518b593f3092fb7b3ba13588e',
 'b20dff2518b593f3092fb7b3ba13588e',
 nan,
 '45a15b38cc3c0514717a1de673c6193c',
 '45a15b38cc3c0514717a1de673c6193c',
 'c230b471b7e21ff9060e68ee154afd70',
 nan,
 nan,
 '629beb8e7317703dcc5f35b5463fd20e',
 '73c5d3186138770b5ae53055adf10ad9',
 '73c5d3186138770b5ae53055adf10ad9',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 nan,
 nan,
 nan,
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 nan,
 'f0ea71b6e2ab4cb3bd8f5ba522a25a56',
 nan,
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 'a8cb589d0aa5e6308a8fcf133ecca711',
 nan,
 nan,
 nan,
 'cae7a908351ea0b7acb0086a51f6f565',
 '5d923ead886c44b86845f69e50520c3e',
 nan,
 nan,
 'ea8062a20ca055f459f3a99eccdec4b5',
 nan,
 nan,
 nan,
 '7af3e2da474486a3519b0cba9dea8ad9',
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 'be59f0b9bd1d08d096188f98168fc987',
 'a8e59319e3c44b5af3a5412d713af5bb',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '3abf2d4698bf245577543ea01d9c7f16',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'e891d4a9622cae3b9fc2ec558bda155b',
 'e891d4a9622cae3b9fc2ec558bda155b',
 '46b48281eb6d663ced748f324108c733',
 'bcb815bba008d89458e428078c0b9211',
 nan,
 nan,
 '0103863bf3441460142ec23c74388e4c',
 '0105b5323d24fc655f73052694dbbb3a',
 '0105b5323d24fc655f73052694dbbb3a',
 nan,
 nan,
 nan,
 nan,
 '0b2f52bff0f8a4536a315bbdc34d8398',
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 '83f86eca5473b6d1c7d566223592908d',
 nan,
 nan,
 'f0ea71b6e2ab4cb3bd8f5ba522a25a56',
 nan,
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'e891d4a9622cae3b9fc2ec558bda155b',
 '45a15b38cc3c0514717a1de673c6193c',
 nan,
 'd7b39598905ba8e791fe3f1f376d59f9',
 nan,
 '7e005c651eec410691559257f9ef8f64',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '1fde73e6b6d28627ff16e222b0044b10',
 '4257f45f6ea58aa46885ed7ef723b6c0',
 nan,
 nan,
 nan,
 'c230b471b7e21ff9060e68ee154afd70',
 nan,
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 '3846c262ddb08db0ea2cc1a1685590ff',
 nan,
 'e76828e87d341935c7d4e9f26420cda0',
 '7ce49c8945602088c4c4ca3e0ceb8244',
 nan,
 '9820e04e332fc450d6fd975befc1bc28',
 nan,
 nan,
 nan,
 nan,
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 'e891d4a9622cae3b9fc2ec558bda155b',
 'e891d4a9622cae3b9fc2ec558bda155b',
 'e891d4a9622cae3b9fc2ec558bda155b',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '3846c262ddb08db0ea2cc1a1685590ff',
 nan,
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 'fbd2a9aa1f115244e2ce115ad8eacfca',
 '629beb8e7317703dcc5f35b5463fd20e',
 nan,
 nan,
 nan,
 'c230b471b7e21ff9060e68ee154afd70',
 nan,
 nan,
 nan,
 nan,
 nan,
 'e275b12a1775bcc17af5524af5ba80b4',
 nan,
 nan,
 nan,
 'fcde1899043318107d8f0ae043f69bc3',
 '0c877471e35049402bca235e06cb8063',
 '0c877471e35049402bca235e06cb8063',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'aaa391b7c35a628a6318851e8e933e0b',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'b5778c55de68777a0a3565e03b2e2817',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '6b7879a37ac2dbe5289a16706e859870',
 nan,
 nan,
 nan,
 nan,
 'c7858457bc43561ccb96aaa2932cb085',
 nan,
 'db1d1a4569fdbae28a0d3340d907f7af',
 nan,
 'fcde1899043318107d8f0ae043f69bc3',
 nan,
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 '1574ed2c73e4465c572f68dd77528203',
 nan,
 'b607bf5bab84cb38111d2b6a10164385',
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 '0502d1a36be75bd36b452f31c6ed264a',
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 '6d0af4d8b482cc38d72acafc37bdd8f1',
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 '46b48281eb6d663ced748f324108c733',
 '46b48281eb6d663ced748f324108c733',
 nan,
 nan,
 nan,
 nan,
 '1a2ece70966710dec441aa587993b5b4',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'ca26080d8f0fa848c329fe732f5e34e8',
 nan,
 '05cf9ac595f28386ee763c98cbc2bad0',
 nan,
 nan,
 nan,
 'c68b419d9c6038271b85bac98adb0fc9',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '6fd83eb3e0799b775e4f946bd66657c0',
 '6fd83eb3e0799b775e4f946bd66657c0',
 '08a59a7c557b1b8d49e5029a337d83fd',
 '0502d1a36be75bd36b452f31c6ed264a',
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 '45a15b38cc3c0514717a1de673c6193c',
 'c321936f167cdd14866ac591eea36f38',
 nan,
 nan,
 'c230b471b7e21ff9060e68ee154afd70',
 'c230b471b7e21ff9060e68ee154afd70',
 'c230b471b7e21ff9060e68ee154afd70',
 'c230b471b7e21ff9060e68ee154afd70',
 nan,
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 nan,
 nan,
 nan,
 nan,
 '51fe26bc4bb0f89f07096d8c96b35476',
 nan,
 nan,
 'db1d1a4569fdbae28a0d3340d907f7af',
 'db1d1a4569fdbae28a0d3340d907f7af',
 nan,
 nan,
 nan,
 nan,
 '9d09d82a6838777302175f209dcfc4ec',
 nan,
 nan,
 '9ce62e73b4ea1a6c74b68ba57aa1333b',
 '9ce62e73b4ea1a6c74b68ba57aa1333b',
 '9ce62e73b4ea1a6c74b68ba57aa1333b',
 '9ce62e73b4ea1a6c74b68ba57aa1333b',
 nan,
 'ed000a9f9f04376551e24de254464c9e',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '62fde58e97724f8b7519f3789eefa33f',
 '62fde58e97724f8b7519f3789eefa33f',
 '62fde58e97724f8b7519f3789eefa33f',
 '62fde58e97724f8b7519f3789eefa33f',
 '62fde58e97724f8b7519f3789eefa33f',
 '62fde58e97724f8b7519f3789eefa33f',
 'b9adb2d8377c195da59e75743dbb7133',
 'c7858457bc43561ccb96aaa2932cb085',
 '73df0070248804951b5c6ba9537c6fc3',
 '73df0070248804951b5c6ba9537c6fc3',
 '73df0070248804951b5c6ba9537c6fc3',
 '73df0070248804951b5c6ba9537c6fc3',
 '73df0070248804951b5c6ba9537c6fc3',
 '73df0070248804951b5c6ba9537c6fc3',
 nan,
 nan,
 'a41e356c76fab66334f36de622ecbd3a',
 nan,
 nan,
 'c230b471b7e21ff9060e68ee154afd70',
 'c230b471b7e21ff9060e68ee154afd70',
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 nan,
 nan,
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 nan,
 '1f5542855786adc0116f56cfd6f9f830',
 nan,
 nan,
 nan,
 'b607bf5bab84cb38111d2b6a10164385',
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 '94c07760f470ab9997f27d0129682eb3',
 '94c07760f470ab9997f27d0129682eb3',
 nan,
 'b20dff2518b593f3092fb7b3ba13588e',
 nan,
 nan,
 nan,
 nan,
 nan,
 '12ad6d9a5da40908f6741145e1a93443',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '3d103df80d5d22d709244fe27bc29a8d',
 '3d103df80d5d22d709244fe27bc29a8d',
 '3d103df80d5d22d709244fe27bc29a8d',
 '3d103df80d5d22d709244fe27bc29a8d',
 nan,
 '3846c262ddb08db0ea2cc1a1685590ff',
 nan,
 nan,
 nan,
 nan,
 'e948cf75e6b64307f7f76a49eaa9fc2f',
 'beec9bc2305d3372fa3b3133ba3fd644',
 '2f282449dadb2e59ff2afcde0bd9f393',
 nan,
 nan,
 '46b48281eb6d663ced748f324108c733',
 nan,
 '0105b5323d24fc655f73052694dbbb3a',
 nan,
 'b3f3d2afe706552a90fd99f870e4ddf5',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '52a5346c9dae9cd560b103875b94d123',
 nan,
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 '0ed27993e016d4546e2038bfd75b812c',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'd1802b79d2e79841cc2cadedd8d4e6a3',
 'd1802b79d2e79841cc2cadedd8d4e6a3',
 '39e3b9b12cd0bf8ee681bbc1c130feb5',
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 nan,
 nan,
 '3a78f64aac654298e4b9aff32fc21818',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '671446e8e3aa3df1eca47b6c354a2921',
 '671446e8e3aa3df1eca47b6c354a2921',
 nan,
 '1a1dba78177cbabf00a2e1eab46e62b6',
 nan,
 nan,
 nan,
 nan,
 nan,
 'a2d0e33d71561d8116565565a6f68578',
 'a2d0e33d71561d8116565565a6f68578',
 '52a5346c9dae9cd560b103875b94d123',
 nan,
 '62fde58e97724f8b7519f3789eefa33f',
 nan,
 nan,
 '50f39d42ee3c1155a39f330b0b53c735',
 '0502d1a36be75bd36b452f31c6ed264a',
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 nan,
 '8c7274c4e638e7f229007fe28aafc741',
 nan,
 '17b9f231d057710012fe28e5a1ac5186',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 '39fda099b9edc21f85a80d14021579b5',
 '0105b5323d24fc655f73052694dbbb3a',
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 nan,
 '8d39a63db6a98e40571f486e0c711cf7',
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 nan,
 nan,
 nan,
 nan,
 '9df852a6c839bc3202227666d70c7cae',
 nan,
 nan,
 nan,
 nan,
 nan,
 '45a15b38cc3c0514717a1de673c6193c',
 nan,
 nan,
 nan,
 'be59f0b9bd1d08d096188f98168fc987',
 nan,
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'eef398550dd9575d23eb2d8be74bd45a',
 nan,
 'e891d4a9622cae3b9fc2ec558bda155b',
 nan,
 nan,
 nan,
 nan,
 'd64615580eb89e5d9a6ae0e37135a5fe',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'b3f3d2afe706552a90fd99f870e4ddf5',
 'b3f3d2afe706552a90fd99f870e4ddf5',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '86047e935e01608082bdae92e41a11db',
 nan,
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 '6b82874c6b51b92913dcdb364eaaae0f',
 nan,
 nan,
 nan,
 nan,
 'b3f3d2afe706552a90fd99f870e4ddf5',
 'b3f3d2afe706552a90fd99f870e4ddf5',
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 '96cadd6d137bc9726fe1fae279d48631',
 '5f4bceafafc1903556735cb7f2842e78',
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 nan,
 '1f5542855786adc0116f56cfd6f9f830',
 nan,
 '46255a421f244cd985f138a63d77d09c',
 nan,
 nan,
 'd6bcc9649c52a826c656563ffb87a036',
 nan,
 '270e70a55f9a0917f86b37cb32afcddd',
 nan,
 'dc9f66a56b89e4278d921898a861a29a',
 nan,
 nan,
 '645fd9f79f14fad6e8a79fbc94dae528',
 nan,
 nan,
 nan,
 '0ed27993e016d4546e2038bfd75b812c',
 '0ed27993e016d4546e2038bfd75b812c',
 nan,
 nan,
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 nan,
 'fcde1899043318107d8f0ae043f69bc3',
 nan,
 nan,
 nan,
 nan,
 'cf1c83c0097bce11c715b94f292769b5',
 nan,
 '0105b5323d24fc655f73052694dbbb3a',
 '0105b5323d24fc655f73052694dbbb3a',
 nan,
 nan,
 nan,
 nan,
 'a8cb589d0aa5e6308a8fcf133ecca711',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '46255a421f244cd985f138a63d77d09c',
 '46255a421f244cd985f138a63d77d09c',
 nan,
 nan,
 nan,
 nan,
 nan,
 '242a6abd32d7630b48ef5c0a1bf527eb',
 '242a6abd32d7630b48ef5c0a1bf527eb',
 '242a6abd32d7630b48ef5c0a1bf527eb',
 '242a6abd32d7630b48ef5c0a1bf527eb',
 '242a6abd32d7630b48ef5c0a1bf527eb',
 nan,
 nan,
 '3373d1ed7bf5f3187f110c0e9e40a02b',
 nan,
 nan,
 '9df852a6c839bc3202227666d70c7cae',
 nan,
 nan,
 '194163a8a65a48a8c95623c399a9d4a7',
 nan,
 nan,
 nan,
 'b61d1388a17e3f547d2bc218df02335b',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'bc0cf6f75f7db177ba3d47677f5e467f',
 nan,
 nan,
 '1f5542855786adc0116f56cfd6f9f830',
 nan,
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 nan,
 nan,
 '46b48281eb6d663ced748f324108c733',
 nan,
 '5fb61f482620cb672f5e586bb132eae9',
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 nan,
 nan,
 'e10758160da97891c2fdcbc35f0f031d',
 nan,
 nan,
 nan,
 'e891d4a9622cae3b9fc2ec558bda155b',
 'cae7a908351ea0b7acb0086a51f6f565',
 nan,
 '794de06c32a626a5692ff50e4985d36f',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'eef398550dd9575d23eb2d8be74bd45a',
 nan,
 'b01cedfa96d891427058852f0625e9ee',
 'b01cedfa96d891427058852f0625e9ee',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '09337f4d317bae01dfa96e6d368eafc4',
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 'ea11e700a343582ad56e4c70e966cb36',
 nan,
 'ab1741a86f20ab3aca7dd2c347d5a9e3',
 '0533c5ec3a042521493cfb6e12d6befd',
 '0533c5ec3a042521493cfb6e12d6befd',
 nan,
 nan,
 '0502d1a36be75bd36b452f31c6ed264a',
 '0502d1a36be75bd36b452f31c6ed264a',
 '0502d1a36be75bd36b452f31c6ed264a',
 '519693b7e29749ed7b6ad60cb07b9400',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '0a7a5763f6ee3b862ca27208f8fbbdb7',
 nan,
 'fedccbd5e370e8ddb7aae6fb4cb70347',
 'ad113e2e3b77e66ef6ecfd7386cd13ac',
 nan,
 'c230b471b7e21ff9060e68ee154afd70',
 'c230b471b7e21ff9060e68ee154afd70',
 'c230b471b7e21ff9060e68ee154afd70',
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 'e891d4a9622cae3b9fc2ec558bda155b',
 '73c5d3186138770b5ae53055adf10ad9',
 '73c5d3186138770b5ae53055adf10ad9',
 '73c5d3186138770b5ae53055adf10ad9',
 'b5d80a1b72c94ff284760dabb4fb776b',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'e49a57a6afd465b7d3cd109daf09ca98',
 'e49a57a6afd465b7d3cd109daf09ca98',
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '194163a8a65a48a8c95623c399a9d4a7',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 'be59f0b9bd1d08d096188f98168fc987',
 '0502d1a36be75bd36b452f31c6ed264a',
 '31dbb0d1815bdc83c93f4b3472fa7c4d',
 nan,
 '212cc0fa7359ab242a697a03a574f719',
 nan,
 'f9b1795281ce51b1cf39ef6d101ae8ab',
 nan,
 '39e3b9b12cd0bf8ee681bbc1c130feb5',
 '3e6092cf0f56b9c78bcbd539157b9f70',
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 nan,
 '3846c262ddb08db0ea2cc1a1685590ff',
 nan,
 nan,
 'fcde1899043318107d8f0ae043f69bc3',
 nan,
 nan,
 'b17a4e7b997ea645dd323bb1bed802ac',
 nan,
 nan,
 '1fde73e6b6d28627ff16e222b0044b10',
 '9ce62e73b4ea1a6c74b68ba57aa1333b',
 'b61d1388a17e3f547d2bc218df02335b',
 nan,
 'b02a3f9824b29932f36a74d811f1b14c',
 '7285e977b36a2243203636b240772c5e',
 nan,
 nan,
 nan,
 nan,
 '194163a8a65a48a8c95623c399a9d4a7',
 nan,
 nan,
 nan,
 nan,
 nan,
 '6b7879a37ac2dbe5289a16706e859870',
 nan,
 '86047e935e01608082bdae92e41a11db',
 '86047e935e01608082bdae92e41a11db',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '014fcf6bd5cd4c7ee29fb3bb618c445e',
 nan,
 'c230b471b7e21ff9060e68ee154afd70',
 nan,
 nan,
 '3e6092cf0f56b9c78bcbd539157b9f70',
 nan,
 nan,
 nan,
 'e275b12a1775bcc17af5524af5ba80b4',
 nan,
 nan,
 '0c877471e35049402bca235e06cb8063',
 '0c877471e35049402bca235e06cb8063',
 nan,
 nan,
 '6868e3f08c6acd37b48c785aedddfdb7',
 nan,
 nan,
 nan,
 nan,
 nan,
 '6b7879a37ac2dbe5289a16706e859870',
 'd7b39598905ba8e791fe3f1f376d59f9',
 'd7b39598905ba8e791fe3f1f376d59f9',
 'd7b39598905ba8e791fe3f1f376d59f9',
 'd7b39598905ba8e791fe3f1f376d59f9',
 'd7b39598905ba8e791fe3f1f376d59f9',
 'd7b39598905ba8e791fe3f1f376d59f9',
 nan,
 nan,
 nan,
 '17823ffd2de8234f0e885a71109613a4',
 nan,
 '3f6f946481fd39f4eda986012f6e0447',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '194163a8a65a48a8c95623c399a9d4a7',
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 '82758517d6c9cc65c509781f1e14e9f2',
 '82758517d6c9cc65c509781f1e14e9f2',
 '82758517d6c9cc65c509781f1e14e9f2',
 '82758517d6c9cc65c509781f1e14e9f2',
 nan,
 nan,
 nan,
 'eef398550dd9575d23eb2d8be74bd45a',
 '3846c262ddb08db0ea2cc1a1685590ff',
 nan,
 '5455dbe20e64b08413e0f2c451e31ab9',
 '9f69acd4da62618a3f6365b732d00ccd',
 'c7858457bc43561ccb96aaa2932cb085',
 nan,
 nan,
 '41f1e5a38185a9ae6090a552b712ba43',
 '46255a421f244cd985f138a63d77d09c',
 nan,
 nan,
 nan,
 nan,
 nan,
 'f58e45b16a42a325c144eb2c46a2bc57',
 nan,
 'a9d8f831888820cd641dcd5ca9fea4e8',
 nan,
 ...]
In [31]:
# product_ids 리스트에 있는 product_id들 중에서, product_category_name이 결측치가 아닌 행들을 필터링

filtered_df = merge_df[merge_df['product_id'].isin(product_ids) & merge_df['product_category_name'].notna()]
filtered_df
Out[31]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... shipping_charges payment_sequential payment_type payment_installments payment_value product_category_name product_weight_g product_length_cm product_height_cm product_width_cm

0 rows × 24 columns

product_category_name에서 결측치를 포함하는 데이터의 product_id를 리스트로 뽑았다. 이 product_id 들에 대한 다른 행의 product_category_name 중 결측치가 아닌 것이 있을 수도 있지 않을까?

  • 확인 결과 없었음.

결론: 해당 product_category_name은 아마 이제 더이상 판매하지 않는 제품들인거 같으므로, 삭제하는 것이 마땅하다고 생각.

product_weight_g ~ product_width_cm¶

  • product_weight_g, product_length_cm, product_height_cm, product_width_cm
  • 해당 컬럼들에 대해서는 전체 데이터 개수 대비 0.71%로 매우 작기 때문에 중앙값으로 대체
    • 극단값의 영향을 줄이기 위해

최종 결측치 처리하는 코드¶

In [32]:
# 각 컬럼의 중앙값으로 결측치를 채우기 [product_weight_g, product_length_cm, product_height_cm, product_width_cm]
merge_df['product_weight_g'].fillna(merge_df['product_weight_g'].median(), inplace=True)
merge_df['product_length_cm'].fillna(merge_df['product_length_cm'].median(), inplace=True)
merge_df['product_height_cm'].fillna(merge_df['product_height_cm'].median(), inplace=True)
merge_df['product_width_cm'].fillna(merge_df['product_width_cm'].median(), inplace=True)

# 결측치를 제거할 열 목록
columns_to_drop_null= [
    'order_approved_at', 
    'order_item_id', 
    'product_id', 
    'seller_id', 
    'price', 
    'shipping_charges', 
    'payment_sequential', 
    'payment_type', 
    'payment_installments', 
    'payment_value', 
    'product_category_name'
]

# 지정된 열들에서 결측치가 있는 행을 제거
merge_df = merge_df.dropna(subset=columns_to_drop_null)

# 분석 프로젝트에 필요 없다고 판단한 컬럼 제거
columns_to_drop = ['payment_sequential']
merge_df = merge_df.drop(columns=columns_to_drop)

merge_df.isna().sum()
Out[32]:
order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                   0
order_delivered_timestamp        2566
order_estimated_delivery_date       0
customer_zip_code_prefix            0
customer_city                       0
customer_state                      0
order_item_id                       0
product_id                          0
seller_id                           0
price                               0
shipping_charges                    0
payment_type                        0
payment_installments                0
payment_value                       0
product_category_name               0
product_weight_g                    0
product_length_cm                   0
product_height_cm                   0
product_width_cm                    0
dtype: int64

order_delivered_timestamp를 제외한 나머지 컬럼들 결측치 처리 완료

3-2. 이상치 처리¶

위에서 살펴봤던 것처럼 모든 수치형 변수들에 이상치가 있는 것으로 확인되어 이상치 처리가 필요한 상황이다.

In [33]:
numerical_df.describe(include='all')
Out[33]:
order_item_id price shipping_charges payment_sequential payment_installments payment_value product_weight_g product_length_cm product_height_cm product_width_cm
count 118325.000000 118325.000000 118325.000000 119157.000000 119157.000000 119157.000000 118305.000000 118305.000000 118305.000000 118305.000000
mean 1.196569 120.729200 20.037466 1.094178 2.941053 172.825878 2113.957364 30.254537 16.633143 23.066692
std 0.699144 184.130919 15.847918 0.726675 2.777111 267.813648 3789.273735 16.189991 13.453618 11.743808
min 1.000000 0.850000 0.000000 1.000000 0.000000 0.000000 0.000000 7.000000 2.000000 6.000000
25% 1.000000 39.900000 13.080000 1.000000 1.000000 60.890000 300.000000 18.000000 8.000000 15.000000
50% 1.000000 74.900000 16.280000 1.000000 2.000000 108.210000 700.000000 25.000000 13.000000 20.000000
75% 1.000000 134.900000 21.190000 1.000000 4.000000 189.260000 1800.000000 38.000000 20.000000 30.000000
max 21.000000 6735.000000 409.680000 29.000000 24.000000 13664.080000 40425.000000 105.000000 105.000000 118.000000
In [34]:
# 수치형 변수에 대한 value_counts

numerical_cols = ['order_item_id', 'price', 'shipping_charges', 'payment_installments', 'payment_value',
                           'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']
numerical_df = merge_df[numerical_cols]

for col in numerical_df.columns:
    print(numerical_df[col].value_counts().to_frame().join(numerical_df[col].value_counts(normalize=True).to_frame().cumsum()))
    print("=" * 50)
                count  proportion
order_item_id                    
1.0            103276    0.876148
2.0             10268    0.963258
3.0              2388    0.983516
4.0               991    0.991924
5.0               471    0.995919
6.0               262    0.998142
7.0                62    0.998668
8.0                36    0.998973
9.0                29    0.999220
10.0               26    0.999440
11.0               17    0.999584
12.0               13    0.999695
13.0                8    0.999762
14.0                7    0.999822
15.0                5    0.999864
16.0                3    0.999890
17.0                3    0.999915
18.0                3    0.999941
19.0                3    0.999966
20.0                3    0.999992
21.0                1    1.000000
==================================================
         count  proportion
price                     
59.90     2623    0.022252
69.90     2115    0.040195
49.90     2035    0.057459
89.90     1628    0.071270
99.90     1522    0.084182
...        ...         ...
21.94        1    0.999966
869.99       1    0.999975
1019.49      1    0.999983
698.00       1    0.999992
213.39       1    1.000000

[5944 rows x 2 columns]
==================================================
                  count  proportion
shipping_charges                   
15.10              3838    0.032560
7.78               2340    0.052411
14.10              1976    0.069175
11.85              1966    0.085854
18.23              1627    0.099656
...                 ...         ...
53.02                 1    0.999966
55.84                 1    0.999975
39.37                 1    0.999983
49.03                 1    0.999992
0.62                  1    1.000000

[6992 rows x 2 columns]
==================================================
                      count  proportion
payment_installments                   
1.0                   58681    0.497824
2.0                   13750    0.614473
3.0                   11778    0.714392
4.0                    8000    0.782261
10.0                   6906    0.840848
5.0                    6040    0.892089
8.0                    5078    0.935169
6.0                    4631    0.974456
7.0                    1841    0.990074
9.0                     736    0.996318
12.0                    164    0.997709
15.0                     94    0.998507
18.0                     38    0.998829
24.0                     34    0.999118
11.0                     25    0.999330
20.0                     21    0.999508
13.0                     18    0.999661
14.0                     16    0.999796
17.0                      7    0.999856
16.0                      7    0.999915
21.0                      5    0.999958
0.0                       3    0.999983
23.0                      1    0.999992
22.0                      1    1.000000
==================================================
               count  proportion
payment_value                   
50.00            347    0.002944
100.00           300    0.005489
20.00            288    0.007932
77.57            251    0.010062
35.00            164    0.011453
...              ...         ...
526.67             1    0.999966
680.82             1    0.999975
438.81             1    0.999983
561.56             1    0.999992
281.43             1    1.000000

[28878 rows x 2 columns]
==================================================
                  count  proportion
product_weight_g                   
200.0              7023    0.059580
150.0              5407    0.105451
250.0              4719    0.145485
300.0              4417    0.182957
400.0              3733    0.214626
...                 ...         ...
7034.0                1    0.999966
361.0                 1    0.999975
9167.0                1    0.999983
13805.0               1    0.999992
2676.0                1    1.000000

[2203 rows x 2 columns]
==================================================
                   count  proportion
product_length_cm                   
16.0               18305    0.155292
20.0               10965    0.248314
30.0                7934    0.315622
17.0                6198    0.368204
18.0                5888    0.418155
...                  ...         ...
83.0                   8    0.999830
96.0                   8    0.999898
94.0                   6    0.999949
9.0                    4    0.999983
8.0                    2    1.000000

[99 rows x 2 columns]
==================================================
                   count  proportion
product_height_cm                   
10.0               10358    0.087873
20.0                6905    0.146452
15.0                6858    0.204632
12.0                6526    0.259996
11.0                6422    0.314477
...                  ...         ...
98.0                   3    0.999932
92.0                   3    0.999958
94.0                   2    0.999975
97.0                   2    0.999992
89.0                   1    1.000000

[102 rows x 2 columns]
==================================================
                  count  proportion
product_width_cm                   
20.0              12715    0.107869
11.0              11019    0.201349
15.0               9329    0.280492
16.0               8850    0.355572
30.0               8023    0.423635
...                 ...         ...
103.0                 1    0.999966
97.0                  1    0.999975
104.0                 1    0.999983
98.0                  1    0.999992
86.0                  1    1.000000

[95 rows x 2 columns]
==================================================

order_item_id¶

order_item_id에 대한 value_counts를 살펴보면

  • 1일 때의 개수가 103,276개로 누적합 약 87.6%를 차지하고 있다.
  • 또한 2일 때의 개수가 10,268개로 전체의 약 10%에 해당한다.

결론: 따라서, 1개를 구매한 경우와 2개 이상을 구매한 경우로 이진화하는 방법으로 범주형 변수로 변환하는 것이 나을 것이라 생각

  • 1개를 구매한 경우 1, 2개 이상을 구매한 경우를 0으로 변환
  • 이진화를 할 경우 불균형 문제를 어느정도 해결할 수 있을 것.
In [35]:
def binarize(df, column_name):
    df[column_name] = df[column_name].apply(lambda x: 1 if x == 1 else 0)
    return df

price¶

우측으로 꼬리가 긴 right-skewed 분포를 가지기 때문에 모델링에 부정적인 영향을 미칠 수 있다.

결론: 따라서, right-skewed 분포를 가져 극단값이 클러스터링 결과에 큰 영향을 미칠 수 있는 이러한 경우에 사용하면 효과적인 IQR 방법으로 이상치 제거.

In [36]:
def remove_outliers_iqr(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df_cleaned = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
    
    return df_cleaned

payment_sequential¶

결론: payment_sequential의 경우 누적합 95%의 값이 1에 집중되어 있다. 따라서, 빈도수가 낮은 값들이 데이터의 왜곡을 초래할 수 있기 때문에, 빈도 기반 필터링을 통해 빈도수가 적은 결제 순서 값을 제거.

In [37]:
def remove_low_frequency_outliers(df, column_name, min_frequency=0.01):
    value_counts = df[column_name].value_counts()
    proportions = value_counts / len(df)
    
    valid_values = proportions[proportions >= min_frequency].index
    
    df_cleaned = df[df[column_name].isin(valid_values)]
    
    return df_cleaned

payment_installments¶

결론: payment_installments는 1개월부터 12개월까지 거의 순서대로 나열되어 있으며, 누적합 액 99.8%를 차지한다.

  • 따라서, 12개월 그 이상의 할부 개월은 이상치로 보고 필터링.
In [38]:
def remove_outliers_by_threshold(df, column_name, threshold):
    df_cleaned = df[df[column_name] <= threshold]
    
    return df_cleaned

payment_value¶

결론: price와 연관이 있는 변수로 동일하게 IQR 방법으로 제거.

shipping_charges¶

결론: IQR 방법으로 처리

product_weight_g, product_length_cm, product_height_cm, product_width_cm¶

결론: 위에서 boxplot을 살펴봤을 때 IQR이 적절해보임

최종 이상치 처리하는 코드¶

In [39]:
# 변수 이진화 함수
def binarize(df, column_name):
    df[column_name] = df[column_name].apply(lambda x: 1 if x == 1 else 0)
    return df

# IQR 제거 방식
def remove_outliers_iqr(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df_cleaned = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
    
    return df_cleaned

# 빈도수 기반 필터링 함수
def remove_low_frequency_outliers(df, column_name, min_frequency=0.01):
    value_counts = df[column_name].value_counts()
    proportions = value_counts / len(df)
    
    valid_values = proportions[proportions >= min_frequency].index
    
    df_cleaned = df[df[column_name].isin(valid_values)]
    
    return df_cleaned

def remove_outliers_by_threshold(df, column_name, threshold):
    df_cleaned = df[df[column_name] <= threshold]
    
    return df_cleaned


# order_item_id 1개는 1로, 2개 이상은 0으로 이진화하는 코드
merged_df_cleaned = binarize(merge_df, 'order_item_id')

# price, shipping_charges, payment_value IQR 방법으로 처리
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'price')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'shipping_charges')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'payment_value')

# payment_installments 12개월 초과는 필터링
merged_df_cleaned = remove_outliers_by_threshold(merged_df_cleaned, 'payment_installments', 12)

# product_weight_g, product_length_cm, product_height_cm, product_width_cm IQR 방법으로 처리
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_weight_g')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_length_cm')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_height_cm')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_width_cm')
In [40]:
# 이상치, 결측치 처리한 데이터

merged_df_cleaned.to_csv('merged_df_cleaned.csv', index=False)
merged_df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Index: 77871 entries, 0 to 119159
Data columns (total 23 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   order_id                       77871 non-null  object 
 1   customer_id                    77871 non-null  object 
 2   order_status                   77871 non-null  object 
 3   order_purchase_timestamp       77871 non-null  object 
 4   order_approved_at              77871 non-null  object 
 5   order_delivered_timestamp      76297 non-null  object 
 6   order_estimated_delivery_date  77871 non-null  object 
 7   customer_zip_code_prefix       77871 non-null  int64  
 8   customer_city                  77871 non-null  object 
 9   customer_state                 77871 non-null  object 
 10  order_item_id                  77871 non-null  int64  
 11  product_id                     77871 non-null  object 
 12  seller_id                      77871 non-null  object 
 13  price                          77871 non-null  float64
 14  shipping_charges               77871 non-null  float64
 15  payment_type                   77871 non-null  object 
 16  payment_installments           77871 non-null  float64
 17  payment_value                  77871 non-null  float64
 18  product_category_name          77871 non-null  object 
 19  product_weight_g               77871 non-null  float64
 20  product_length_cm              77871 non-null  float64
 21  product_height_cm              77871 non-null  float64
 22  product_width_cm               77871 non-null  float64
dtypes: float64(8), int64(2), object(13)
memory usage: 14.3+ MB
In [41]:
plt.figure(figsize=(20, 10))

for idx, col in enumerate(numerical_cols):
    plt.subplot(3, 4, idx+1)
    sns.boxplot(y=merged_df_cleaned[col])  
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col) 
    plt.ylabel('Value')

plt.tight_layout()
plt.show()
No description has been provided for this image

3-3 변수 스케일링¶

3-3-1. 수치형 변수 스케일링¶

현재 수치형 변수가 오른쪽으로 꼬리가 긴 right-skewed 분포를 가지고 있다. 따라서, 로그변환 방법으로 변수들을 조금 더 정규분포에 더 가깝게 만들어 비대칭성을 줄인다.

이후 정규화(Normalization)을 통해 데이터의 범위를 일정한 구간으로 변환하여 모델의 학습 속도와 성능을 높이고자 한다.

  • KNN과 같은 거리 기반 알고리즘에서는 변수 간의 값의 범위가 일치하지 않으면 거리 계산에서 문제가 발생할 수 있는데, 정규화를 통해 해결 가능.

'product_length_cm', 'product_height_cm', 'product_width_cm'¶

세 개의 컬럼을 곱해서 부피 컬럼을 만들고 기존 3 개의 컬럼은 드랍.

In [42]:
# 'product_length_cm', 'product_height_cm', 'product_width_cm' 곱해서 'volume' 컬럼 생성
merged_df_cleaned['volume'] = merged_df_cleaned['product_length_cm'] * merged_df_cleaned['product_height_cm'] * merged_df_cleaned['product_width_cm']

# 기존의 세 개의 컬럼 드랍
merged_df_cleaned = merged_df_cleaned.drop(columns=['product_length_cm', 'product_height_cm', 'product_width_cm'])

merged_df_cleaned.head(10)
Out[42]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date customer_zip_code_prefix customer_city customer_state ... product_id seller_id price shipping_charges payment_type payment_installments payment_value product_category_name product_weight_g volume
0 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP ... 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 29.99 8.72 credit_card 1.0 18.12 housewares 500.0 1976.0
1 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP ... 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 29.99 8.72 voucher 1.0 2.00 housewares 500.0 1976.0
2 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff delivered 10/2/2017 10:56 10/2/2017 11:07 10/10/2017 21:25 10/18/2017 0:00 3149 sao paulo SP ... 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 29.99 8.72 voucher 1.0 18.59 housewares 500.0 1976.0
3 53cdb2fc8bc7dce0b6741e2150273451 af07308b275d755c9edb36a90c618231 delivered 7/24/2018 20:41 7/26/2018 3:24 8/7/2018 15:27 8/13/2018 0:00 47813 barreiras BA ... 595fac2a385ac33a80bd5114aec74eb8 289cdb325fb7e7f891c38608bf9e0962 118.70 22.76 wallet 1.0 141.46 toys 400.0 4693.0
4 47770eb9100c2d0c44946d9cf07ec65d 3a653a41f6f9fc3d2a113cf8398680e8 delivered 8/8/2018 8:38 8/8/2018 8:55 8/17/2018 18:06 9/4/2018 0:00 75265 vianopolis GO ... aa4383b373c6aca5d8797843e5594415 4869f7a5dfa277a7dca6462dcf3b52b2 159.90 19.22 credit_card 3.0 179.12 toys 420.0 9576.0
5 949d5b44dbf5de918fe9c16f97b45f8a 7c142cf63193a1473d2e66489a9ae977 delivered 11/18/2017 19:28 11/18/2017 19:45 12/2/2017 0:28 12/15/2017 0:00 59296 sao goncalo do amarante RN ... d0b61bfb1de832b15ba9d266ca96e5b0 66922902710d126a0e7d26b0e3805106 45.00 27.20 credit_card 1.0 72.20 toys 450.0 6000.0
6 ad21c59c0840e6cb83a9ceb5573f8159 72632f0f9dd73dfee390c9b22eb56dd6 delivered 2/13/2018 21:18 2/13/2018 22:20 2/16/2018 18:17 2/26/2018 0:00 9195 santo andre SP ... 65266b2da20d04dbe00c5c2d3bb7859e 2c9e548be18521d1c43cde1c582c6de8 19.90 8.72 credit_card 1.0 28.62 toys 250.0 11475.0
9 6514b8ad8028c9f2cc2374ded245783f 932afa1e708222e5821dac9cd5db4cae delivered 5/16/2017 13:10 5/16/2017 13:22 5/26/2017 12:55 6/7/2017 0:00 26525 nilopolis RJ ... 4520766ec412348b8d4caa5e8a18c464 16090f2ca825584b5a147ab24aa30c86 59.99 15.17 credit_card 3.0 75.16 toys 50.0 4352.0
15 34513ce0c4fab462a55830c0989c7edb 782987b81c92239d922aa49d6bd4200b delivered 7/13/2017 19:58 7/13/2017 20:10 7/19/2017 14:04 8/8/2017 0:00 4278 sao paulo SP ... f7e0fa615b386bc9a8b9eb52bc1fff76 87142160b41353c4e5fca2360caf6f92 98.00 16.13 credit_card 1.0 114.13 toys 325.0 9261.0
16 82566a660a982b15fb86e904c8d32918 e97109680b052ee858d93a539597bba7 delivered 6/7/2018 10:06 6/9/2018 3:13 6/19/2018 12:05 7/18/2018 0:00 35400 ouro preto MG ... 72a97c271b2e429974398f46b93ae530 094ced053e257ae8cae57205592d6712 31.90 18.23 wallet 1.0 50.13 toys 450.0 4352.0

10 rows × 21 columns

고객별 재구매 횟수, 구매 총 상품 금액, 구매 평균 금액, 총 결제액, 평균 결제 액¶

In [43]:
merged_df_cleaned['repeat_order'] = merged_df_cleaned.groupby('customer_id')['order_id'].transform('nunique')
merged_df_cleaned['total_price'] = merged_df_cleaned.groupby('customer_id')['price'].transform('sum')
merged_df_cleaned['avg_price'] = merged_df_cleaned.groupby('customer_id')['price'].transform('mean')
merged_df_cleaned['total_payment_value'] = merged_df_cleaned.groupby('customer_id')['payment_value'].transform('sum')
merged_df_cleaned['avg_payment_value'] = merged_df_cleaned.groupby('customer_id')['payment_value'].transform('mean')
In [44]:
# 수치형, 범주형 변수 재정의

numerical_cols = ['price', 'shipping_charges', 'payment_value', 'product_weight_g', 'volume']
categorical_cols = ['order_status','customer_zip_code_prefix', 'customer_city', 'customer_state',
                    'order_item_id', 'payment_type', 'payment_installments', 'product_category_name']
In [45]:
# 수치형 변수 로그변환 함수

def log_transform(df, cols):
    df_log_transformed = df[cols].apply(lambda x: np.log1p(x))
    
    return df_log_transformed

def normalize(df, cols):
    scaler = MinMaxScaler()
    df_normalized = pd.DataFrame(scaler.fit_transform(df[cols]), columns=cols, index=df.index)
    
    return df_normalized

# 로그 변환
log_transformed_df = log_transform(merged_df_cleaned, numerical_cols)

# 정규화 수행
normalized_df = normalize(log_transformed_df, numerical_cols)

# 결과를 원래 데이터프레임에 반영
merged_df_cleaned[numerical_cols] = normalized_df
In [46]:
# 변수 스케일링 후 수치형 kdeplot()

plt.figure(figsize=(20, 10))

for idx, col in enumerate(numerical_cols):
    plt.subplot(3, 4, idx+1)
    ax = sns.kdeplot(merged_df_cleaned[col], shade=True)
    plt.title(f'KDE Plot of {col}')

plt.tight_layout()  
plt.show()
No description has been provided for this image

3-4. 범주형 변수 인코딩¶

One-Hot-Encoding : 각 카테고리를 0과 1로 구성된 벡터로 표현하는 기법. 카테고리의 수만큼 벡터가 생성되므로 각 카테고리가 새로운 변수가 되어 표현된다. 다만, 카테고리가 너무 많은 변수의 경우 데이터의 cardinality를 증가시켜 모델의 성능을 저하시킬 수 있다는 단점.

Label-Encoding : n개의 범주형 데이터를 0 ~ n-1의 연속적인 수치 데이터로 표현. 데이터의 범주가 3개 이상일 때는 주의해서 사용. 라벨 인코딩은 한번 실행시킬 때 단 하나의 컬럼만 실행 가능.

우리 데이터에는 범주형 변수가 많고, 예를들어 product_category_name에는 70개의 카테고리를 가지고 있기 때문에, 원-핫 인코딩과 라벨 인코딩을 사용하는 데는 한계가 있을 것으로 사료된다. 따라서, 다음의 두 가지를 고려한 인코딩 방법을 생각해야 할 것이다.

  • 카디널리티 문제 : 원-핫 인코딩의 경우, 카테고리의 수가 많을수록 차원이 증가하여 데이터가 희소해지고, 계산 비용이 높아지며, 모델 성능이 저하될 수 있다.
  • 거리 측정 문제: 라벨 인코딩은 범주형 데이터를 순서가 있는 수치 데이터로 변환하기 때문에, KNN과 같은 거리 기반 모델에서는 인코딩된 값 간의 인위적인 거리 차이가 생겨 잘못된 결과를 초래할 수 있다.

우리의 선택 : Label-Encoding

  • 범주형 데이터가 많기는 하지만, 그 카테고리를 줄인 후 레이블 인코딩을 하고자 한다.

추가적으로

  • customer_zip_code_prefix, customer_city, customer_state와 같이 위치 정보를 나타내는 컬럼들은 따로 빼서 5개의 지방으로 묶는 과정이 필요.
    • 참고링크 : https://ko.wikipedia.org/wiki/%EB%B8%8C%EB%9D%BC%EC%A7%88%EC%9D%98_%EC%A3%BC
  • 장난감도 카테고리가 70개나 되기 때문에 묶어서 줄이는 과정이 필요.
  • 날짜 변수들 order_purchase_timestamp, order_approved_at, order_delivered_timestamp, order_estimated_delivery_date은 dtype을 datetime으로 변경해 향후 분석에 더욱 용이하게 사용할 수 있도록 한다.
  • order_status, payment_type 레이블 인코딩

customer_state, customer_zip_code_prefix, customer_city¶

In [47]:
# state로 5개의 지방으로 묶기

state_to_region = {
    'AC': '북부 지방', 'AL': '북동부 지방', 'AP': '북부 지방', 'AM': '북부 지방',
    'RR': '북부 지방', 'RO': '북부 지방', 'PA': '북부 지방', 'PB': '북동부 지방',
    'MA': '북동부 지방', 'PI': '북동부 지방', 'PE': '북동부 지방', 'RN': '북동부 지방',
    'CE': '북동부 지방', 'SE': '북동부 지방', 'BA': '북동부 지방', 'DF': '중서부 지방',
    'TO': '북부 지방', 'GO': '중서부 지방', 'MS': '중서부 지방', 'MT': '중서부 지방',
    'RJ': '남동부 지방', 'SP': '남동부 지방', 'MG': '남동부 지방', 'ES': '남동부 지방',
    'RS': '남부 지방', 'SC': '남부 지방', 'PR': '남부 지방'
}

# state 컬럼을 기준으로 지방으로 변환
merged_df_cleaned['region'] = merged_df_cleaned['customer_state'].map(state_to_region)

# customer_zip_code_prefix, customer_city, customer_state 컬럼 drop
columns_to_drop = ['customer_zip_code_prefix', 'customer_city', 'customer_state']
merged_df_cleaned = merged_df_cleaned.drop(columns=columns_to_drop)
In [48]:
# 결과 확인
merged_df_cleaned['region'].value_counts()
Out[48]:
region
남동부 지방    55872
남부 지방     11141
북동부 지방     5471
중서부 지방     4463
북부 지방       924
Name: count, dtype: int64

region 맵핑:

  • 남동부 지방 : 0
  • 남부 지방 : 1
  • 북동부 지방 : 2
  • 북부 지방 : 3
  • 중서부 지방 : 4
In [49]:
# 결과 확인했으니 이제 레이블 인코딩

label_encoder = LabelEncoder()

# 'region' 컬럼을 레이블 인코딩
merged_df_cleaned['region'] = label_encoder.fit_transform(merged_df_cleaned['region'])
merged_df_cleaned['region'].value_counts()
Out[49]:
region
0    55872
1    11141
2     5471
4     4463
3      924
Name: count, dtype: int64

product_category_name¶

In [50]:
# product_category_name 종류별로 묶어서 카테고리 줄이기

category_mapping = {
    '가구/인테리어': [
        'furniture_decor', 'furniture_living_room', 'furniture_bedroom', 
        'furniture_mattress_and_upholstery', 'kitchen_dining_laundry_garden_furniture', 
        'la_cuisine', 'flowers', 'cool_stuff', 'perfumery', 'party_supplies', 
        'bed_bath_table', 'market_place', 'home_construction', 'christmas_supplies'
    ],
    '패션': [
        'fashion_underwear_beach', 'fashion_bags_accessories', 'fashion_shoes', 
        'fashion_male_clothing', 'fashion_sport', 'fashion_childrens_clothes', 
        'fashio_female_clothing', 'housewares', 'watches_gifts'
    ],
    '전자제품': [
        'telephony', 'computers_accessories', 'audio', 'tablets_printing_image', 
        'cine_photo', 'musical_instruments', 'consoles_games', 'dvds_blu_ray', 
        'music', 'electronics', 'air_conditioning', 'small_appliances', 
        'home_appliances', 'home_appliances_2', 'small_appliances_home_oven_and_coffee', 
        'home_comfort_2', 'signaling_and_security', 'security_and_services', 
        'fixed_telephony'
    ],
    '건설/공구': [
        'construction_tools_construction', 'costruction_tools_garden', 
        'construction_tools_safety', 'construction_tools_lights', 
        'costruction_tools_tools', 'garden_tools'
    ],
    '생활용품': [
        'baby', 'diapers_and_hygiene', 'health_beauty', 'home_confort', 
        'luggage_accessories', 'auto', 'food', 'drinks', 'food_drink', 
        'sports_leisure', 'pet_shop', 'agro_industry_and_commerce'
    ],
    '문구/사무용품': [
        'stationery', 'office_furniture', 'books_technical', 
        'books_general_interest', 'books_imported', 'arts_and_craftmanship', 
        'art', 'industry_commerce_and_business'
    ],
    '장난감': ['toys']
}

# 카테고리 매핑을 수행하는 함수
def map_category(category_name):
    for main_category, subcategories in category_mapping.items():
        if category_name in subcategories:
            return main_category
    return '기타'  

# `product_category_name` 컬럼을 매핑하여 새로운 컬럼 추가
merged_df_cleaned['product_category_group'] = merged_df_cleaned['product_category_name'].apply(map_category)

# 'product_category_name' 컬럼 삭제
merged_df_cleaned = merged_df_cleaned.drop(columns='product_category_name')

merged_df_cleaned['product_category_group'].value_counts()
Out[50]:
product_category_group
장난감        58189
생활용품        6108
가구/인테리어     4968
전자제품        4164
패션          2730
건설/공구       1107
문구/사무용품      605
Name: count, dtype: int64

product_category_group 인코딩:

  • 가구/인테리어: 0
  • 건설/공구: 1
  • 문구/사무용품: 2
  • 생활용품: 3
  • 장난감: 4
  • 전자제품: 5
  • 패션: 6
In [51]:
# product_category_name 잘 맵핑 됐으니 이제 레이블 인코딩

label_encoder = LabelEncoder()

# 'product_category_group' 컬럼을 레이블 인코딩
merged_df_cleaned['product_category_group'] = label_encoder.fit_transform(merged_df_cleaned['product_category_group'])

# 변환된 데이터 확인
merged_df_cleaned['product_category_group'].value_counts()
Out[51]:
product_category_group
4    58189
3     6108
0     4968
5     4164
6     2730
1     1107
2      605
Name: count, dtype: int64

order_purchase_timestamp, order_approved_at, order_delivered_timestamp, order_estimated_delivery_date¶

In [52]:
# 범주형 변수들 처리 전 날짜 변수들의 dtype datetime으로 변경.
date_columns = ['order_purchase_timestamp', 'order_approved_at', 
                'order_delivered_timestamp', 'order_estimated_delivery_date']

for col in date_columns:
    merged_df_cleaned[col] = pd.to_datetime(merged_df_cleaned[col])

merged_df_cleaned[date_columns].dtypes
Out[52]:
order_purchase_timestamp         datetime64[ns]
order_approved_at                datetime64[ns]
order_delivered_timestamp        datetime64[ns]
order_estimated_delivery_date    datetime64[ns]
dtype: object

order_status, payment_type¶

  • 'delivered': 0, 'shipped': 1, 'canceled': 2, 'unavailable': 3, 'processing': 4, 'invoiced': 5, 'created': 6, 'approved': 7

  • 'credit_card': 0, 'wallet': 1, 'voucher': 2, 'debit_card': 3, 'not_defined': 4

In [53]:
# order_status, payment_type 인코딩

# 매핑 딕셔너리 정의
order_status_mapping = {
    'delivered': 0,
    'shipped': 1,
    'canceled': 2,
    'unavailable': 3,
    'processing': 4,
    'invoiced': 5,
    'created': 6,
    'approved': 7
}

payment_type_mapping = {
    'credit_card': 0,
    'wallet': 1,
    'voucher': 2,
    'debit_card': 3,
    'not_defined': 4
}

# 직접 레이블 인코딩 수행
merged_df_cleaned['order_status_encoded'] = merged_df_cleaned['order_status'].map(order_status_mapping)
merged_df_cleaned['payment_type_encoded'] = merged_df_cleaned['payment_type'].map(payment_type_mapping)

# 원본 컬럼 제거 (선택 사항)
merged_df_cleaned = merged_df_cleaned.drop(columns=['order_status', 'payment_type'])
In [54]:
merged_df_cleaned.iloc[:, :10].head(3)
Out[54]:
order_id customer_id order_purchase_timestamp order_approved_at order_delivered_timestamp order_estimated_delivery_date order_item_id product_id seller_id price
0 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff 2017-10-02 10:56:00 2017-10-02 11:07:00 2017-10-10 21:25:00 2017-10-18 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 0.562297
1 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff 2017-10-02 10:56:00 2017-10-02 11:07:00 2017-10-10 21:25:00 2017-10-18 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 0.562297
2 e481f51cbdc54678b7cc49136f2d6af7 7c396fd4830fd04220f754e42b4e5bff 2017-10-02 10:56:00 2017-10-02 11:07:00 2017-10-10 21:25:00 2017-10-18 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 0.562297
In [55]:
merged_df_cleaned.iloc[:, 12:].head(3)
Out[55]:
payment_value product_weight_g volume repeat_order total_price avg_price total_payment_value avg_payment_value region product_category_group order_status_encoded payment_type_encoded
0 0.516399 0.772611 0.418826 2 125.36 31.34 82.82 20.705 0 6 0 0
1 0.192265 0.772611 0.418826 2 125.36 31.34 82.82 20.705 0 6 0 2
2 0.520649 0.772611 0.418826 2 125.36 31.34 82.82 20.705 0 6 0 2
In [56]:
merged_df_cleaned.to_csv('merged_df_pre_processed.csv', index=False)

merged_df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Index: 77871 entries, 0 to 119159
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       77871 non-null  object        
 1   customer_id                    77871 non-null  object        
 2   order_purchase_timestamp       77871 non-null  datetime64[ns]
 3   order_approved_at              77871 non-null  datetime64[ns]
 4   order_delivered_timestamp      76297 non-null  datetime64[ns]
 5   order_estimated_delivery_date  77871 non-null  datetime64[ns]
 6   order_item_id                  77871 non-null  int64         
 7   product_id                     77871 non-null  object        
 8   seller_id                      77871 non-null  object        
 9   price                          77871 non-null  float64       
 10  shipping_charges               77871 non-null  float64       
 11  payment_installments           77871 non-null  float64       
 12  payment_value                  77871 non-null  float64       
 13  product_weight_g               77871 non-null  float64       
 14  volume                         77871 non-null  float64       
 15  repeat_order                   77871 non-null  int64         
 16  total_price                    77871 non-null  float64       
 17  avg_price                      77871 non-null  float64       
 18  total_payment_value            77871 non-null  float64       
 19  avg_payment_value              77871 non-null  float64       
 20  region                         77871 non-null  int32         
 21  product_category_group         77871 non-null  int32         
 22  order_status_encoded           77871 non-null  int64         
 23  payment_type_encoded           77871 non-null  int64         
dtypes: datetime64[ns](4), float64(10), int32(2), int64(4), object(4)
memory usage: 14.3+ MB

4. 군집 모델 실행¶

In [58]:
cluster_1 = merged_df_cleaned[['repeat_order','total_price', 'avg_price', 'total_payment_value', 'avg_payment_value']]
In [62]:
from sklearn.cluster import KMeans

# 군집 2개로 나누기
custoemr_kmeans2 = KMeans(n_clusters=2, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans2.fit(cluster_1)

cluster_1['cluster2'] = custoemr_kmeans2.labels_

# 군집 3개로 나누기
custoemr_kmeans3 = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans3.fit(cluster_1)

cluster_1['cluster3'] = custoemr_kmeans3.labels_

# 군집 4개로 나누기
custoemr_kmeans4 = KMeans(n_clusters=4, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans4.fit(cluster_1)

cluster_1['cluster4'] = custoemr_kmeans4.labels_
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\1661325313.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_1['cluster2'] = custoemr_kmeans2.labels_
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\1661325313.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_1['cluster3'] = custoemr_kmeans3.labels_
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\1661325313.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster_1['cluster4'] = custoemr_kmeans4.labels_
In [63]:
# 실루엣 계수

from sklearn.metrics import silhouette_score

labels = custoemr_kmeans2.fit_predict(cluster_1)

silhouette_1 = silhouette_score(cluster_1, custoemr_kmeans2.labels_)
print(f'클러스터 개수 2개일 때: Silhouette Score = {silhouette_1:.6f}')

from sklearn.metrics import silhouette_score

labels = custoemr_kmeans3.fit_predict(cluster_1)

silhouette_2 = silhouette_score(cluster_1, custoemr_kmeans3.labels_)
print(f'클러스터 개수 3개일 때: Silhouette Score = {silhouette_2:.6f}')

from sklearn.metrics import silhouette_score

labels = custoemr_kmeans4.fit_predict(cluster_1)

silhouette_3 = silhouette_score(cluster_1, custoemr_kmeans4.labels_)
print(f'클러스터 개수 4개일 때: Silhouette Score = {silhouette_3:.6f}')
클러스터 개수 2개일 때: Silhouette Score = 0.765187
클러스터 개수 3개일 때: Silhouette Score = 0.575237
클러스터 개수 4개일 때: Silhouette Score = 0.559589
In [64]:
wcss_2 = custoemr_kmeans2.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_2:.4f}')


wcss_3 = custoemr_kmeans3.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_3:.4f}')

wcss_4 = custoemr_kmeans4.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_4:.4f}')
클러스터 개수 3개일 때: WCSS = 3528533448.0313
클러스터 개수 3개일 때: WCSS = 2497989545.3095
클러스터 개수 3개일 때: WCSS = 1931635348.2375
In [65]:
plt.figure(figsize=(30,10))
plt.subplot(1,3,1)
sns.scatterplot(data=cluster_1, x='total_price', y='total_payment_value', hue='cluster2')
plt.subplot(1,3,2)
sns.scatterplot(data=cluster_1, x='total_price', y='total_payment_value', hue='cluster3')
plt.subplot(1,3,3)
sns.scatterplot(data=cluster_1, x='total_price', y='total_payment_value', hue='cluster4')
Out[65]:
<Axes: xlabel='total_price', ylabel='total_payment_value'>
No description has been provided for this image

4-1. 엘보우 포인트 확인하기¶

In [66]:
def elbow(df):
    sse = []
    for i in range(1,15):
        km = KMeans(n_clusters= i, init='k-means++', random_state=42)
        km.fit(df)
        sse.append(km.inertia_)
    
    plt.plot(range(1,15), sse, marker = 'o')
    plt.xlabel('cluster count')
    plt.ylabel('SSE')
    plt.show()

elbow(cluster_1)
No description has been provided for this image

4-2. 컬럼 별 상관관계 확인¶

In [67]:
import seaborn as sns
import matplotlib.pyplot as plt

# 각 열의 값들 사이의 상관관계 행렬을 구합니다.
correlation_matrix = cluster_1.corr()

# 상관관계 행렬을 바탕으로 히트맵을 그립니다.
sns.heatmap(data=correlation_matrix, annot=True, cmap='RdYlGn')

plt.show()
No description has been provided for this image

5. 군집 결과 해석¶

5-1. 군집 1¶

In [72]:
# 원본 자료는 살리기 위해 복사하기
merged_clu_df = merged_df_cleaned.copy()

# 클러스트 항목 설정
cluster_1 = merged_df_cleaned[['repeat_order','total_price', 'avg_price', 'total_payment_value', 'avg_payment_value']]

# 학습하기
# 군집 3개로 나누기
custoemr_kmeans3 = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans3.fit(cluster_1)

# 군집된 결과 저장
merged_clu_df['cluster'] = custoemr_kmeans3.labels_


# 컬럼 지우는 함수
def del_cols(df):
    del df['order_id']
    del df['customer_id']
    del df['seller_id']
    del df['product_id']
    del df['order_approved_at']
    del df['order_delivered_timestamp']
    del df['order_estimated_delivery_date']
    
# 군집 특성 확인하는 함수
def character_visual(df):
    plt.figure(figsize=(20,20))

    for i in range(len(df.columns)):
    
        cols = list(df.columns)[i]
    
        plt.subplot(4,6,i+1)
        sns.histplot(df, x=cols, palette='RdYlGn')
        plt.title(cols)
        
In [73]:
# 1번 군집만 설정
merged_clu1_df = merged_clu_df[merged_clu_df['cluster'] == 0]

del_cols(merged_clu1_df)
character_visual(merged_clu1_df)
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
No description has been provided for this image

5-2. 군집 2¶

In [74]:
merged_clu2_df = merged_clu_df[merged_clu_df['cluster'] == 1]

del_cols(merged_clu2_df)
character_visual(merged_clu2_df)
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
No description has been provided for this image

5-2. 군집 3¶

In [75]:
merged_clu3_df = merged_clu_df[merged_clu_df['cluster'] == 2]

del_cols(merged_clu3_df)
character_visual(merged_clu3_df)
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.histplot(df, x=cols, palette='RdYlGn')
No description has been provided for this image

6. 추가 시각화¶

In [82]:
import seaborn as sns
import matplotlib.pyplot as plt

merged_df = merged_clu_df
merged_df['order_purchase_timestamp'] = pd.to_datetime(merged_df['order_purchase_timestamp'])
merged_df['quarter'] = merged_df['order_purchase_timestamp'].dt.to_period('Q')

# 모든 subplot에서 동일한 x축 레이블을 사용하기 위해 고유한 quarter 값을 얻음
unique_quarters = sorted(merged_df['quarter'].unique())

# 군집별로 quarter의 분포 시각화
plt.figure(figsize=(12, 6))

for cluster in range(3):  # 3개의 군집이 있으므로 0, 1, 2로 루프
    plt.subplot(1, 3, cluster + 1)
    sns.countplot(data=merged_df[merged_df['cluster'] == cluster], x='quarter', palette='RdYlGn')
    plt.title(f'Cluster {cluster} Quarter Distribution')
    plt.xlabel('Quarter')
    plt.ylabel('Count')
    plt.xticks(ticks=range(len(unique_quarters)), labels=unique_quarters, rotation=45)  # 동일한 x축 레이블 설정

plt.tight_layout()
plt.show()
No description has been provided for this image
In [84]:
import pandas as pd

# 분기별 군집
rows = []

for quarter in sorted(merged_df['quarter'].unique()):
    row = {
        'Quarter': quarter,
        'Cluster 0 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 0)].shape[0],
        'Cluster 1 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 1)].shape[0],
        'Cluster 2 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 2)].shape[0]
    }
    rows.append(row)

# 데이터프레임으로 변환
summary_df = pd.DataFrame(rows)
print(summary_df)
  Quarter  Cluster 0 Count  Cluster 1 Count  Cluster 2 Count
0  2016Q3                1                0                0
1  2016Q4              167                4               71
2  2017Q1             3094               90              809
3  2017Q2             5523              128             1616
4  2017Q3             7309              226             2355
5  2017Q4            10413              244             3522
6  2018Q1            12451              353             3980
7  2018Q2            11397              341             3560
8  2018Q3             7855              164             2198
In [85]:
import pandas as pd
import matplotlib.pyplot as plt

# 데이터프레임 만들기
summary_df = pd.DataFrame(rows)

# 분기 열 str로 타입 변경
summary_df['Quarter'] = summary_df['Quarter'].astype(str)

# 시각화
plt.figure(figsize=(12, 6))

plt.plot(summary_df['Quarter'], summary_df['Cluster 0 Count'], marker='o', label='Cluster 0')
plt.plot(summary_df['Quarter'], summary_df['Cluster 1 Count'], marker='o', label='Cluster 1')
plt.plot(summary_df['Quarter'], summary_df['Cluster 2 Count'], marker='o', label='Cluster 2')

plt.xlabel('Quarter')
plt.ylabel('Count')
plt.title('Quarterly Distribution of Orders by Cluster')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [88]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.font_manager as fm

plt.rcParams['font.family'] = 'AppleGothic'

daily_payment = df.groupby(df['order_purchase_timestamp'].dt.date)['payment_value'].mean()

plt.figure(figsize=(14, 7))

plt.plot(daily_payment.index, daily_payment.values, label='Average Daily Payment Value', color='skyblue')

plt.scatter(daily_payment.index, daily_payment.values, color='red', s=10, label='Payment Value Points')

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gcf().autofmt_xdate()

plt.title('Change in monthly average sales value')
plt.xlabel('Date')
plt.ylabel('Average Payment Value')
plt.legend()
plt.grid(True)
plt.show()
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 50900 (\N{HANGUL SYLLABLE WEOL}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 48324 (\N{HANGUL SYLLABLE BYEOL}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 54217 (\N{HANGUL SYLLABLE PYEONG}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 44512 (\N{HANGUL SYLLABLE GYUN}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 54032 (\N{HANGUL SYLLABLE PAN}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 47588 (\N{HANGUL SYLLABLE MAE}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 44032 (\N{HANGUL SYLLABLE GA}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 52824 (\N{HANGUL SYLLABLE CI}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 51032 (\N{HANGUL SYLLABLE YI}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 48320 (\N{HANGUL SYLLABLE BYEON}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 54868 (\N{HANGUL SYLLABLE HWA}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
No description has been provided for this image